Amino acid substitution matrices from an information theoretic perspective

Amino acid substitution matrices from an information theoretic perspective

J. Mol. Riol. (1991) 219, 555-565 Amino Acid Substitution Matrices from an Information Theoretic Perspective Stephen F. Altschul National Center for...

1MB Sizes 0 Downloads 146 Views

J. Mol. Riol. (1991) 219, 555-565

Amino Acid Substitution Matrices from an Information Theoretic Perspective Stephen F. Altschul National

Center for Biotechnology Information National Library of Medicine National Institutes oj Health Bethesda, MD 20894, [J.S.A.

(Received 1 October 1990; accepted 12 February

1991)

Protein sequence alignments have become an important tool for molecular biologists. Local alignments are frequently constructed with the aid of a “substitution score matrix“ that specifies a score for aligning each pair of amino acid residues. Over the years, many different substitution matrices have been proposed, based on a. wide variety of rationales. Statistical results, however, demonstrate that any such matrix is implicitly a “log-odds” matrix. with a specific target distribution for aligned pairs of amino acid residues. In the light of information theory, it is possible to express the scoresof a substitution matrix in bits and to see that different matrices are better adapted to different purposes. The most widely used matrix for protein sequence comparison has been the PAM-250 matrix. It is argued that for database searches the PAM-120 matrix generally is more appropriate, while for comparing two specific proteins with suspected homology the PAM-200 matrix is indicated. Examples discussedinclude the lipocalins, human cr,B-glycoprotein, the cystic fibrosis transmembrane conductance regulator and the globins. K~yumd~: homology; sequence comparison; statistical significance; alignment, algorithms: pattern recognition

1. Introduction General methods for protein sequencecomparison were introduced to molecular biology 20 years ago and have since gained widespread use. Most early attempts t’o measure protein sequence similarity focused on global sequence alignments, in which every residue of the two sequencescompared had to participate (Needleman & Wunsch, 1970; Sellers, 1974; Sankoff 8r. Kruskal, 1983). However, because distantly related proteins may share only isolated regions of similarity, e.g. in the vicinity of an active site, attention has shifted to local as opposed to global sequence similarity measures. The basic idea is to consider only relatively conserved subsequences;dissimilar regions do not contribute to or subtract from the measure of similarity. Local similarity may be studied in a variety of ways. These include measures based on the longest matching segmems of two sequenceswith a specified number or proportion of mismatches (Arratia et aE., 1986; Arratia & Waterman, 1989), as well as methods that compare all segments of a fixed, predefined “window“ length (McLachlan, 1971). The most common practice, however, is to consider segments of all lengt,hs. and choose those that optimize a

similarity measure (Smith & Waterman. 1981; Goad & Kanehisa, 1982; Sellers, 1984). This has the advantage of placing no a priori restrictions on the length of the local alignments sought. Most database search methods have been based on such local alignments (Lipman & Pearson, 1985; Pearson & Lipman, 1988; Altschul et al., 1990). To evaluate local alignments, scoresgenerally are assigned to each aligned pair of residues (the set of such scoresis called a substitution matrix), as well as t,o residues aligned with nulls; the score of the overall alignment is then taken t,o be the sum of t,hese scores. Specifying an appropriate amino acid substitution matrix is central to protein comparison methods and much effort has been devoted to defining, analyzing and refining such matrices (McLachlan, 1971; Dayhoff et al., 1978; Schwartz & Dayhoff, 1978; Feng et aE., 1985; Rao, 1987; Risler et ml.. 1988). One hope has been to lind a matrix best adapted to dist.inguishing distant evolutionary relationships from chance similarities. Recent mathematical results (Karlin &, Alt,schul, 1990; Karlin et al., 1990) allow all substitution matrices to be viewed in a common light, and provide a rationale for selecting particular sets of “optimal” scores for local protein sequence comparison.

2. The Statistical Significance Sequence Alignments

of Local

Global alignment,s are of essentially no use unless they can allow gaps, but this is not, true for local alignments. The ability to choose segments with arbitrary starting positions in each sequence means that biologically significant regions frequently may be aligned without t.he need to introduce gaps. While, in general, it, is desirable to allow gaps in loc~~l alignments. doing so greatly decreases their tnat.hematical tractability. The results described here apply rigorously only to local alignments t,hat lack gaps, i.e. to segments of equal length from each of t,he two sequences compared. Some recent dat.abase search tools have focused on finding such aligtlmerits (Altsrhul & Lipman, 1990: Altsehul of ~1.. 1990). However, the statistics of optimal scores for local alignments that include gaps (Smit,h Pf a/.. 1985; Waterman rt al.. 1987) are broadly analogous t.o those for the no-gap case (Karlin & Altsehul, 1990; Karlin et al.. 1990). where more precise result,s are available. Therefore. one may hope that many of the basic ideas presented below will generalize to local alignments t)hat include gaps. Formally, we assume that the aligned amino acids ai and aj are assigned the substitution score ,sij. Given two prot’ein sequences, the pair of equal Ietlgth segment’s that, when aligned. have the greatest aggregate score we call the Maxima1 Segment, Pair (MSF’?). An MSP may he of ati! length: it,s score is the ,MSF-’ score. Since any two protein sequences, related or utlrelated. will have some MSP score, it, is important t,o know how great a score one can expect to find simply hy chance. To address t,his question one needs some model of chance. The simplest is to assume that in the two proteins compared. t,he arnino acid ai appears randomly with the prob ability pi. These probabilities are chosen to reflect t,he observed frequencies of the amino acids in actual proteins. For simplicity of discussion we will assutne both proteins share the same amino acid probability distribution; more generally. one C&an allow them to have different distributions. A random protein sequence is simply one c*onstructed according to this model. For the sake of the statistical theory, we need to make t’wo crucial but reasonable assumptions about the substitution scores. The first is that, there he at least one positive score and the second is that the expected score xi, j yipjsij be negative. Because we permit the length of a segment pair to be adjusted to optimize its score, both these assumptions are necessary also from a practical perspective. If there were no positive scores. the MSP would always consist of a single pair of residues (or none at all, if this were permitted), and such an alignment is not of interest. If the expected score for two random residues were positive, extending a segment pair as t Ig.

Ahbt~eviations in~tnutro~iobulin.

trsrcl:

MSI’.

Jlasitnal

Segtnr~nt

I’ait,:

far as possible would always tend to increase it’s score: this violates the idea of seeking locaal aligntnents. Substitution mat,rices used in other caotrt.exls, such as global alignments (Seedleman & \Vutrsc*h. 1970) or local a1ignmrnt.s using windows (McLachlan. 1971), need not, sat,isf:v t hesr constraints. However. unless otherwise stat,ed. it will be assumed below that, any substitution matrix sa,tisfies the two conditions described. The stat,istical theory of LVMsT’ scores (Karlin KAltschul, 1990; Karlin rt al.. 1990) involves a kr~ parameter 2. which is the unique posit’ivr solutiotr to the equation:

Not,ice that multiplying all the s(aores of a sul)stit\ltion matrix hv some positive constant tloeh tlot effect t)he relative scores of any subalignments. T\j.o tnat,rices relat,ed hv such a factor can. therefore. I)ta considered essent\all,v equivalent. Inspe(+ion 01’ equation (1) reveals that multiplying all scores 1)~ I! also has t.he rfiect of dividing 2 by n. The parameter I. ma\-. therefore. he viewed as a natural sc*alr fi)t any scoring system: its deeper meaning will he discussed helow. Given two random prot,ein srq~tww’s as dcw7iltcvl how many distinct. or “locally opt imill“ above. (Sellers. 1984) MSPs with score at least S at’( expected to occur simple I)>: chwncae! This nutnt)csr is well >ipproximatcld 1)~ the formula: h’s t’ 2s (L’) where X is t)he product of the sequen~s‘ lengths. and h’ is an explicitly calculable parameter (liilrlitl B Altsetiul. 1990: Karlin cf r/l.. 1990). \Vtrelr comparing a single random sequent with aII t ht. sequences in a database. setting S to the procluc.1 ot the query sequence length and the database lrngt h (in residues) yields an upper hound on the trutnher of distincut MSPs with s(aore at least S that thta search is expected to yield.

3. Optimal Substitution Matrices for Local Sequence Alignment Formula (2) allows us to tell when A segment [jail, has a significantly high score. However. ii does Ilot assist in choosing it11 appropriate slll)st it nt ion tnat,rix in the first> place. A se~ontl c~lass of’ results. however. ha,s direct bearing on this question. ‘I’hc~st~ state that among YIST’s from t,he c*otnparison of random sequences, the amino acids (li ;rIrd f/j :LI’CL aligned with frequencay approaching rlij = //i/$ eAazl (Arratia rt al., 1988; Karlin & Altschul, 1990: Karlill ut al.. 1990: Demho & Karlin. 1991). Given any substitution matrix and random f)rot,ein model, one may ea,sily c~alculatr the set ot target frequencies. qij. just described. Notice that by the definition of 1 in equation (1). these target frequencies sum to 1. Now among alignments repsent’ing distant, homologies. the atnino ac+ls are

Amino

Acid

Substitution.

paired with certain characteristic frequencies. Onl) if these correspond t,o a matrix’s target frequencies. it has been argued, can the matrix be optimal for distinguishing distant local homologies from simiIaritiei due to chance (Karlin & Altschul. 1990). Any subst.itution matrix has an implicit set of t,a.rget frequen(Gesfor aligned amino acids. ll’riting the scores of the matrix in terms of its target frequencies. one has: sij =

( ) 1nEYij

i I A.

(:s)

Tn other words. the score for an amino acid pair can be written as the logarithm to some base of that pair’s target frequency divided by the background frequrnc*y with \vhich the pair occurs. Such a ratio (~ornparrs the probability of an event occ*urring under two alternative hvpotheses and is called a likelihood or odds rat/o. Scores that are the logarithm of odds ratios are called log-odds scores. Adding such scorescan be thought of as multiplying the corresponding probabilities, which is appropriate fc)r independent events, so that t’he total score remains a log-odds score. Log-odds matrices have been advocated in a number of contexts. (Dayhoff et al., 1978; Gribskov rl al., 1987; Storm0 & HartzelI. 1989). The wideIS used PAM matrices (Dayhoff et al., 1978), for instance? are explicitly CJft,his form. Other substitution mat’rices, though based on a wide variety of rationales. are all log-odds matrices, but with itnplicitj rather than explicit target frequencies. Therefore. while one may criticize the method described by Dayhoff et al. for estimating appropriate target frequencies (Wilbur, 1985), the most direct way t,o derive superior matrices appears to be through the refined estimation of amino acid pair target and background frequencies rather than through arry fundamentally different approach.

4. Substitution

Matrices

for Global

Alignments

While we have been considering substitution matrices in the context of local sequence comparison, they may be employed for global alignment as well (Xeedleman & Wunsch, 1970; Sellers, 1974; Shwartz & Dayhoff, 1978). There is a fundamental difference. however, between the use of such trrat,rices in these two (sontexts. For global alignments, as previously, multiplying all scores by a fixed positive number has no effect on the relative scores of different alignments. But adding a fixed cluantit,y (I to the score for aligning any pair of residues (and (l/2 to the score for aligning a residue with a null) likewise has no effect. Scoring systems I hat may be t,ransformed into one another by means of t,hese two rules are, for all practical purposes, equivalent llnfortunately, the new transformation tneans that no unique log-odds interpretation of global substitution mat,rices is possible, and it is

MatrGxs

557

doubtful that any “target distribution” theorem can be proved. It may be possible to make a convincing casefor a particular substitution matrix in the global alignment context, but the argument will most likely have to be different from that for local alignments (Karlin & Altschul. 1990). The sa,me applies t,o substitution matrices used with fixed-length windows for studying local similarities (SlcLachlan, 1971; Argos, 1987; Storm0 & Hartzell, 19X9): a fixed quantity can be added to all entries of such a matrix with no essential effect,. It is notable that while the PAM matrices were developed original]\- fi)r global sequence comparison (Dayhoff et al.. 19?8), their statistical theory has blossomed in the 1o(aalalignment context.

5. Local Alignment Scores as Measures of Information .Multiplying a substitution matrix by a constant changes 2 but does not alter the matrix’s implicit t,arget, frequencies. By appropriate scaling, one may therefore select the parameter I at will. Writing the matrix in log-odds form, such scaling corresponds merely to using a different implicit base for the logarithm. One natural choice for 1 is I, so that all scores become natural logarithms. Perhaps more appealing is to choose A= In 2 ;2 0.693, so that the base for the log-odds matrix becomes2. This lends a particularly intuitive appeal to formula (2). Setting the expected number of MSPs with score at least 8 equal to p. and solving for 8, one finds:

For typical substitution matrices, K is found to be near 0.1. and an alignment may be considered significant when p is @05. Therefore the right-hand side of equation (4) generally is dominated by the term log, ,I;. In other words, the score needed to distinguish an NSP from chance is approximatjely the number of bit,s needed to specif? where the MSP starts in each of the two sequencesbeing compared. (one bit can be thought of as the answer t,o a single yes-no question; it is the amount of information needed to distinguish between 2 possibilities. It’ becomes apparent that, in general. log, N bits of information are needed to distinguish among S possibilities.) For comparing two prot’eins of lengt,h 250 amino arid residues, about 16 bits of information are required; for comparing one such protein to a sequence database containing 4.000,OOO residues, about 30 bits are needed. When cast in t,his light, alignment scores are not arbit’rary numbers. By appropriate scaling (multiplying by 1/0.693) they take on the units of bits, and rough significance calculations can be performed in one’s head. Furthermore, when so normalized, different amino acid substitution matrices ma! be directly compared.

6. The Relative Substitution

Entropy Matrix

t,ein evolution. The amount of evolutionary c.hangr that yields. on average, one substitution in 100 amino acid residues they (*alled one PAM. [‘sing their model. one may easily c~alculatr the frequent> with which any two amino acid residues are paired in an accurate alignment of two homologous I)roteins that have diverged by any given atnou~tt of’ evolutionary change. These target frrquenc*irs ma>t,hen be used to construct log-odds matrictes and. it1 particular. the widely used PAM-2.W mat ris. DayhofI rf crl. ( 1978) originally proposed this mat ris for the global alignment of two sequencaes suspt~.t~l to he homologous, but it has sinc.6, I)ocn used to search prot,ein databases for loc*al alignmrnts to ;I query sequence (Lipman & Pearson. 1985: l’earsoii Kr Llpman. 198X). One may therefore itlquircb whether 260 I’AMs yield reasonable target frt~qnelrties for database searches. Assuming the model described by I)aytiofi rt trl. (1978), Table 1 lists the relat.ire entropy II implicit in a range of PAM matrices. As argued above,. distinguishing an alignment from c~hanc~ein it search of a typical current protein database using an average length protein requires about 30 I)it s of information. Accordingly. for an alignmt~nt of’ segments separated t)y a givr>n PA,11 distancsf>. otirl (aan cxalculat,e the minimum length necflssary to rise above background noise: t,hrse lengths are rr~c~ot~drtl in Table 1 For instance. at a distancr of’250 I’il\Ms. on average only 0.36 hit of information is available per alignment position. To bp statisticaall!, signiti(*ant , suc.h an aligntnrtrt would need to lrav(l :I length grrat,rr than about X3 residues. Many biologically intrre&ng regions of protein sitnilarity ill’? much short.er than this, and ac~c~ordingly nck(atl a stronger signal to he detected. A loc~~l alignment of length 20 residues will nwtf about 1.;i bits t)et’ ;Lligtltnent posit,ion, while one of trngt.h 10 residuc~s will nerd about 0.75 bit. Table 1 shows that such aliglrnients will not bc tletectabk~ if their c*otist it lrfwt

of a

The above review of previous results has provided us with the necessary tools for the analysis that follows. The ultimate goal is to decide which substitution matrices are the most appropriate for database searching and for detailed pairwise sequence comparison. Given a random protein model and a substitution matrix, one may calculate the target frequencies qij characteristic of the alignments for which the matrix is optimized. A useful quantity to consider is the average score (information) per residue pair in these alignments. Assuming the substit,ution matrix is normalized as described above, this value is simply: H = 1 qijsij

i,j

= 1 qij

log, ;;;.

i.j

Notice that H depends both on the substitution matrix and on the random protein model. In information theoretic terms, H is the relative entropy of the target and background distributions. The origin of the name need not be of concern. The important point is that, for an alignment, characterized by the target frequencies qij% FI measures the to average information available per position distinguish the alignment’ from chance. Tntuitivrly. t,he higher the value of the relative entropy of target and background distributions, the more easily the\ are distinguished. For a high value of H, relativei! short alignments with the target distribution ran be distinguished from chance. while. if the value of II is lower, longer alignments are necessary. It is interesting to examine the PAM model of molecular evolution (Dayhoff rt al.. 1978) from this standpoint. From a study of mutations between a large number of closely related proteins. I)aThoff and co-workers proposed a stocha,stic model of pro-

Table 1 The

N (bits) 417 3.43 2.95 2.57 2526 1fKl 1.79 1fro 1.44 I.30 1.18 1a8 0.98 090 ow 0.76: 0~70 o%L?

relative

entropy

Min. signifirant length (30 bits) x 9 II I2 14 15 I7 19 21 24 26 18 31 34 37 40 43 47

H

of

PAM

PAM distance

matricrs

II

(bits)

Min. significant length (30 bits)

Amino

Acid

Substitution

559

Matrices

Table 2 The average score (in bits) per alignment position, when using given PAM matrices to compare segments in fact separated by a variety of PAM distances I’AH matrix ,I! rmplo~rd

segments have diverged 150 PAMs. respectively.

7. PAM

40

by more than

Actual 110

80

about

75 and

Matrices for Database Searching Two-sequence Comparison

and

The relative entropy associated with a specifics PAM distance indicates how much information per position is optimally available. For a given alignment. one can attain such a score only by using the appropriate PAM matrix, but, of course. before the alignment is found it will not be known which matrix that is. It has therefore been proposed that a variety of PAM matrices be used for database searches (Collins et al., 1988). We seek here to analyze how many such matrices are necessary, and which should be used. Suppose one uses a matrix optimized for PAM distance M to compare two homologous protein segment,s that are actually separated by PAM distance D. For a range of values of M and D, the average score achieved per alignment position is shown in Table 2. Notice that for any given matrix ,I!, the smaller the actual distance D, the higher the score. On the other hand, for a specific distance D, the highest score corresponds to the matrix with PAM distance M = D; this score is just the relative entropy discussed above. Using a PAM matrix with Jl near D, however, can yield a near-optimal score.

Table 3 Ranges of local alignment PA J! matrices

I’.W matrix 40 X0 I20 I60 200 “40 “X0 320 Xi0

93 9, efficiency range for database searching (30 hits) !)to 13 to 19 to 26 to 36 to 47 to 60 to 7.5 to 94 to

PI 34 50 70 94 113 155 192 233

lengths for which are appropriate

various

87 y. efficiency range for I-sequence comparison (16 hits) 4 6 9 12 Iti PI 27 34 42

to to to to to

14 12 33 46 62 to 80 to 101 to Id4 to 149

PAM

distance 160

D of segmrnts 200 240

1x0

320

For example, the relative entropy for D = 160 is 0.70 bit, but any PAM matrix in the range 120 to 200 yields at least @67 bit per position. In practice, how near the optimal is it important to be? As argued above, for a given PAM distance t,here is a critical length at which alignments are just distinguishable from chance in a typical current database search; these lengths are recorded in Table 1. For the sake of analysis, we will assume that it is worth performing an extra search (using a different PAM matrix) only if it is able to increase the score for such a critical alignment by about two bits, corresponding to a factor of 4 in significance. Since a critical alignment has about 30 bits of information, we will therefore be satisfied using a PAM matrix that yields a score greater than 93% of t,he optimal achievable. Using data such as those shown in Table 2. one can calculate for which PAM distances D (and thus for which critical lengths) a given matrix M is appropriate; the results are recorded in Table 3. Our experience has shown that perhaps the most typical lengths for distant local alignments are those for which the PAM-120 matrix gives near-optimal scores, i.e. lengths 19 to 50 residues. Therefore, if one wishes to use a single standard matrix for database searches, the PAM-120 matrix (Table 4) is a reasonable choice. This matrix may, however, miss short but strong or long but weak similarities that contain sufficient information t’o be found. Accordingly, Table 3 shows that t,o complement the PAM-120 matrix, the PAM-40 and PAM-240 (or traditional PAM-250) matrices can be used. Additional matrices should improve the detection of distant similarities only marginally (i.e. raise their s(sores by at most 2 bits). If, rather than searching a database with a query sequence, one wishes to compare two specific seyuences for which one already has evidence of relatedness, the background noise is greatly decreased. As discussed above, for two proteins of fypical length, about 16 bits are needed to chstinguish a local alignment from chance. Accordingly, applying the same crit,eria as before, a matrix should be considered adequate for those PAM distances at which it yields an average score within 87’& of the optimal. In Table 3, we list the range of crit,ical lengths over which various PAM

560

S. F.

Altschul

Table 4

The PAM-120

,matrix

with

scores

in

hnlf

hits

6 -I -3 -4

-2 -4 2 -1 -4 -1 -2 -7 -4 0

-6 - 3 K

matrices are appropriate for detailed pairwise sequence comparison. As a single matrix, thf> PAM-200 spans the most typical range of local alignment lengths, i.e. 16 to 62 residues. Alternatively, if two different’ matrices are to be used, the PAM-80 and PAM-250, which together span alignment lengths 6 to 85 residues, or t,hr PAM-120 and PAM-320 matrices, which span lengths 9 to 124 residues, appear to be appropriate pairs. Since it is convenient to express substitution matrices as integers, and since a probability factor of 2 between score levels is too rough, the units for the PAM-120 matrix shown in Table 4 are half bits. The scores in the original PAM-250 matrix (Dayhoff et al., 1978) were scaled as 10 x log,,. Because lO/(ln 10) z 3/(ln 2) to within 0.4%; a unit’ score in that matrix can be thought of as approximately one-third of a bit.

8. Biological

and PAM-120 scores for MSPs representing distant relationships to four different query sequences. In all cases,we consider relationships near the limits of what can be distinguished from chance in a search of the PIR protein sequencedatabase (Release 26.0: 7,348,950 residues). It will be noticed that the highest chance PAM-250 scores are consistently slightly smaller than the highest chance PAM-120 scores. This is primarily attributable to the fact that the parameter K discussedabove is about half as large for the former scores as for the latter. Furthermore. since neither the PIR database nor a given query sequence ever precisely fits the random protein model described by Dayhoff et al. (1978), t,hcl parameter 1 varies slightly from one comparison to another. Therefore, while we will treat the PAM-120 scores from Table 4 as half bits, and t,he PAM-250 scores of Dayhoff et al. (1978) as one-third bits, it should be noted that’ this is always a slight approximation.

Examples

As discussed, the part,icular PAM matrix that best distinguishes distant homologies from chance similarities found in a database search depends on the nature of the homologies present, and this cannot be known a priori. However, it is frequent.]) the case that distantly related proteins will share isolated stretches of relatively conserved amino acid residues, corresponding to active sites or other important structural features. It has been observed that in general the mutations along genescoding for proteins are not Poisson-distributed (Uzzell & Corbin, 1971; Holmquist et al., 1983). suggesting that short, conserved regions are to be expect,ed. As shown in Table 3, this means that the widely used PAM-250 matrix generally will not be optimal for locating distant relationships. In the examples below, we compare t,he PAM-250

We used t’hr BLAST program (Altschul P! (xl.. 1990) to search the PIR database with human apolipoprotein D precursor (PTR code LPHCD; I)rayna it al.. 1987). using both t)he PAM-250 (Dayhoff rf rtl.. 1978) and PAM-120 (Table 4) substitution matri(~(~s. Human apolipoprotein I> precursor is a 1X9 rrsiduc, that k)elfJngS to tilt, ~ipof~dill glycoprotein (a,-microglobulin) superfamily. which contains proteins that exhibit a wide range of functions rc~lated t,o their abilit,y to bind small hydrophobic ligands. The similarities among these proteins and their biological roles have been analyzed (Peitsch & Boguski. 1990). and crystal st,ructures are available for several members of the superfamily ((‘owan rt (11.. 1990). Three proteins in the superfamily are rat androgen-dependent epididymal protein (I’IK, c~tle

Amino

Acid

Substitution Matrices

561

Table 5 Three MSPs representing sequence database (release

distant relationships. from searches of the PI R protein 26.0) with human apolipoprotein D precursor (PIR codp LPHUD) Optimal

PIR

code

Optimal

PAM-250

PAM-250

score

alignment

(bits)

Optimal

PAM-120

score

(bits)

LPHUD

25

LGKCPNPPVQENFDVNKYLGRWYEI

49

SQRTAD

12

LAAGTEGAWKDFDISKFLGFWYEI

36

27.0

33.5

A32202

27

HDTVQPNFQQDKFLGRWY

44

25.7

33.5

28

NIQVQENFNISRIYGKWYNL

47

23.0

30.5

27.0

29.0

SO0758

SO0758

HCHU

chance

Highest PIR

code

of

alignment

sequence

score:

involved:

I,I’H[T[). h,~man apolipoprotrin rat protein prcwwwr: A31202, irrtrlr-x-trypsin inhibitor precursor:

1) precursor: SQRTAI), rat androgen-dependent prostaglandin-D SOO758. human

SQRTAI); Brooks et a,l.. 1986), rat prostaglandin-I) synthase (PTR code A32202; Grade et 01.. 1989) and human n,-microglobulin (PIR code HC’Hl’: Kaumeyrr et c/l.. 1986). The secaondof these has onl! recently been recognized as a member of the supe~‘~ family (M. S. Boguski & JI. (1. T’eitsch. personal csommunication): it is the first such member with known caatalytic activity (1:rade et al.. 19X9). ITsing PAM-150 s~)res. t,hc maximal segment pair for each of these sequences when c*ompared to LPHI:I) is shown in Table 5. These local similarities c~orrespondto one of two motifs that are conserved throughout th(b superfamily (Koguski B States. 1990). The s(aoresfor the three alignments are 27.0. 25.7 and 23.0 bits. rrspec%ively. However. the highest score from a protein in t,hr database unrelated to I,PHI’l> is 27,O bits, involving human surface glvcolxotc~in

(‘I11 fi

lxe’c’ursor

(T’TR,

(*ode

SO&FiX:

Simmons & Shed. 198X). The PAM-250 matrix therefore fails to srparat,e the homolngous alignrnrnts shown f’rom background noise. In contrast rising fhe PAX-1 20 matrix of Table 4. the s(*oresfor the three alignm(ants jump to 33.5. 33.5 and 30.5 bits, respectively. (The 1st 7 alignment positions for I,PHUI)-SQIITAI) shown in Table 5 arc dropped in an optimal PAM-1 PO alignmrnt. as art’ the 1st 3 positions for the LPHL’I)-A32202 alignment.) This raises their scores above that of the best chance PAM-120 alignment (%!I+ bits). again involving hutnirn surfacargiJ-coprotc~in(‘I)16 precursor. Soticth that. in both (xst’s the estimate that about JO bits art’ nc~edetl &arty to distinguish an MST’ from cahancr is valid. For this query sequence. no relationship is found using the PAM-250 matrix that is missed Iby the PAM-120.

epididvmal I%5 li

synthase: HC’HY, human a,-mieroglohulin, surface glycoprotrin (‘DIG ~WCU~SO~

(b) Human

cc,K-glycoprotritl

We searched the PIR database with human cx,B-glycoprotein (PIR code OMHrl K: lshioka et al., 1986), a plasma glycoprotein of unknown function, and a member of the immunoglobulin superfamily. [‘sing the PAM-250 matrix, the only protein in the database with an MSP that rises above background noise is pig PO%F protein (PIR code J’l,OO30; Van dc Weghe it al.. 19X8). which acshieves a score of 32.3 bits. As shown in Table 6. the score for this known homology (Van de 1Vcghe rt rrl.. 19x8) rises to 15.0 bits when the PAM-120 matrix is used instead. In addition, two proteins with imnlunoglobulitl domains. kinasr-related transforming protein precursor (PIR code SOO17-t:Qin et al., 1988) and human Ig K chain precursor V-III region (PIR code KSHUVH; Pech &, Zachau. 1984), achieve scores of 29.0 and 28.5 bit,s, respectively. Table 6 illustrates that both these similarities are only just distinguishable from chancr. and that using the PAM-250 matrix both similarities drop in score by at least, four bits. (c) The rystic jibrosis transmambmnr conductance regulator

The cause of cystic fibrosis has been traced to mutations in a protein that bears striking similarity to ma,ny proteins involved in the transport of substances across the cell membrane (PIR code A30300: Riordan et al.. 1989). Characteristic features of the protein are two nucleotide (ATP)binding folds (Higgins et al., 1986). When the PIR databa,se is searched with A30300, many related

562

S.

I*‘. Altschul

Table ‘I’h,rPY

MSPs

represen,ting

6

distant relationships, frown searches of the PI Ed protein sequence 260) with human a ,R-glycoprotrin (PIR code OMH C.TlH) Optimal

PIR

code

Optimal

PAM-250

1 AIFYETQPSLWAESESLLKP

PLO030

1 ALFLDPPPNLWAEAQSLLEPWANVTLTSQS

OMHUlB

171

score

alignment

OMHlJlB

PAM-250

LANVTLTCQA

dutabasr

(~1~u.s~~

Optimal

(bits)

PAM-1201

score

(bits)

30 30

LSEPSATVTIEELAAPPPPVLMHHGESSQVLHPGNKVTLTCVAPLS

32.3

45.0

216

so0474

18 LRGQTATSQPSASPGEPSPPSIHPAQSELIVEAGDTLSLTCIDP

61

25.0

29.0

K3HUVH

15

48

22.0

28.5

27.0

28.0

540102

WGSMHH

LPDTTREIVMTQSPPTLSLSPGERVTLSC-SQS Highest PIR

chance code

of

alignment sequence

score: involved:

OMHIT113, human a, U-glycoprotrin: I’IAMM~. pig I’oB F protein: SOO474. kirtasr~relatrti trsnsf’orming KSHI’VH. human Tg h’ chain precursor V-III rrgion (Vh): .JQOlOZ, eggplant mosaic virus RNA wplicasr \V(XvIHH. Streptomyws hygromyin I< phosphotransferase (Zalac;tin PI (~1.. 19%).

proteins may be identified easily using either t,hc PAM-250 or the PAM-120 substitution matrix. However, several distant’ relationships present are harder to detect. In Table 7 are shown four optimal PAM-250 alignments, representing homologies to each of the two A30300 nucleotide-binding folds. Xone of these alignments has a PAM-250 score as great as the highest chance score of 31.3 bits. In contrast. when the PAM-120 matrix is used, the

Fou,r

MSPs

repreCsenting 26.0) with

code

Table 7 relationships, from searches qf th,e PI R protein sequence database Jibrosis transmembrane conductance regulator (PI R code A30300)

cystic

Optimal

PAM-250

438

SO5328

18

BVECUA

11 THNLKNINLVIPRDKLIWTGLSGSGKSSL 1219

VSKDINLEIQDGEFVVE'VGPSGCGKSTLLRMIAGLETVTSGDL

YTEGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRLLNTEGEI

QRECFH

19 FRVF'GRTLLHPLSLTFPAGKVTGLIGHNGSGKSTLLKMLGR

QREBOT

31

DGDVTAVNDLNFTLRAGETLGIVGESGSGKSQSP.LPxLMGLLATNGRI Highest code

chance of

alignment sequence

PAM-250

score

alignment

TPVLKDINFKIERGQLLAVAGSTGAGKTSLLMMIMGELEPSEGKI

PIR

prrcwsor. r,/ 01. I!M).

distant

A30300

A30300

(kit)

alignments jump in score by 4 to almost I2 bits, giving all but one a score greater than the highest chance PAM-120 score of 334 bits. (The boundaries of the optimal alignments change slightly under the alternate scoring scheme.) So biologically significant similarity is distinguished by the PAM-250 matrix that is not found using the PAM-120. The relatively high chance scores found in this example are partly attributable to the length of the query

Optimal PIR

protrin (Osorio-Kww

score: Involved:

(bits)

(r~icas~

Optimal

PAM-120

score

(bits)

482 60

28.3

40.0

40

24.7

35.0

59

29.3

35.0

77

28.3

32.5

31.3

33.0

A34416

A32916

1267

~W~300. ('ystic tibrosis tratlslnrlrlhrant. conductanct~ rrpulator: S(l5:128. li:rrtrrohfwlrr clrroyrnr* inlwr nwrnhrir~w protritl malli (I)ilhl r,/ I//.. 1989); I3VE(‘VA. .bWwrichia wli uvr.4 protrilr (Husitin rf ~1.. 1986); QKM‘FH. f’elric.hrolnr~i~otl transport protrlu fhu(’ (( ‘oultou P/ trl.. 1987): QREBO'I', oligoprptide pwmease membrane protrin oppl) (Higgins rf trl.. I%%); .-U111(i. fukr h~~iro*~tllrtl~~lylnt~~~~I (‘o,\ wductasr (NAI)PH) (Rajkovic pt (I/.. 1989): A329lti. I’ihrio hanv~y’ acyl-protein synthctasr (fragmt.nt) (Johnston rl crl.. I!#!,)

Amino Acid Substitution GPVF

49

SO6134

GPVF SO6134

SAGWDSPKLGAHAEKVFG~DSAVQL~TGEWLDGKDGSISIQ S AHA”

94 I.

61

ASQLRSSRQMQAHAIRVSSIMSEYIEELDSDILPEL~T~THDL

95

KGVLDPHFWVXEALLKTIKEASGDKWSEELSRAWEVAyDGLATAI L ” H G

107

L

H 106

140 w

A

NKVGPAHYDLFAKVLMEALQAELGSDFNQKTRDSWAKAFSIVQAVL

152

Figure 1. Thr PAAM-% maximal segment l)air of broatl bran leghrmoglobin I (I’TR code (:I’\.F) and sea (*uwmber hrmoglobin I (I’TR co& SO61 S-C). Identical residues are echcwd on thr wntral linr. P.\Jl-%O WOW. L5.3 bits: length. 9:! twiduw.

sequence (1480 residues), and partly to its composition, which renders the parameter 1 slightly smaller than in the previous examples. (d) Clobins It is possible to find examples of long alignments representing distant relationships that are better distinguished by the PAM-250 than by the PAM-120 matrix. In practice such examples are rare, for some of the reasons discussed above. The globins are one superfamily in which sequencedivergence has been relatively uniform over the length of entire proteins. As a result, some sequence relationships within this superfamily become apparent, only with scoring systems tailored for long but very weak alignments. For example, searching the PIR database with broad bean leghemoglobin I (PIR code GPVF; Richardson rt al., 1975), the alignment with sea cucumber hemoglobin I (PTR code S06134; Suzuki, 1989), shown in Figure 1, is found having a PAM-250 score of 253 bits. This is almost as high as the score of the best chance MSP (26.7 bits), which involves AWmonella typhimurium cystathionine fl-lyase (PTR code JVOO2fJ; Park & Stauffer, 1989). The alignment, is 92 residue pairs long; only 14 of these pairs involve identical amino acid residues, and they are spread fair1.y evenly along the alignment. This particular similarity is totally obscured when PAM-120 scores are used. The best region of the alignment shown then involves residues 100 to 133 of the leghemoglobin sequence and has a score of only 13 bits, while the best chance PAM-120 alignment, involving mouse hepatitis virus El membrane glycoprotein (PTR code VGIHE I : Armstrong rt al.. 1984), scores 27.5 bits. Nevertheless, as in the previous examples, a number of relationships are distinguished by the PALM-l 20 matris but m&Id by the PAM-250.

9. Conclusion This paper has analyzed the properties of amino acid substitution matrices in the context of local alignments

lacking

gaps. This

is exactly

the sort of

alignment sought by the recently developed BLAST database search programs (Altschul et al., 1990: Altschul

& Liptn;tn.

1990). WV have concluded

that

Matrices

563

for protein databases of typical current size (about 1 x 10’ residues), t,he most broadly sensitive substitution matrix should be a log-odds matrix wit,h relative entropy of about one bit, e.g. t,he PAM-120 matrix. In order to detect short but strong homologies or long but weak ones, this matrix can be complemented by the PAM-40 and PAM-250 matrices; additional matrices should be of only marginal utility. Of course, many database search methods, such as the FASTA programs (Lipman h Pearson, 1985; Pearson & Lipman, 1988). seek locaal alignments with gaps, and such measuresare potrntially more sensitive to distant homologies. Unfortunately. if gaps wit,h associated scores are allowecl, the specific quantitative discussionabove is no longer correct. Nevertheless, the general thrust of the arguments should still apply, and theory and experiment suggest that analogous results will hold for local alignments with gaps (Smith rt al.. 19X5; U’at,erman et al.. 1987; Collins et al.. 198X). There are. of course, many much more involved ways for assessing local alignment than those discussedhere. Scores can be assigned to aligned dresidues or t)ri-residues: they can depend on alignment length (Altschul 8t Erikson. 1986): or they can be complex combinations of various scoring methods (Argos, 1987). Protein databases may also be searcshed with position-dependent scores or “profiles” constructed from multiple alignments (Taylor, 1986; Gribskov et al., 1987: Patthy, 1987). In certain contexts such systems may well be morp sensitive than the straightforward local scoring system considered here. Two advantages of simple additive scores are their amenability to powerf’ul algorithmic methods (Altschul et al.. 1990) and to rigorous statistical analysis (Karlin $ Altschul. 1990: Karlin et al., 1990). Such analysis ma? also yield insight into the properties of more c~ompllc*atet-l scoring scahemes. The author thanks Urs David Lipmatt. Mark Hoguski and Andrew MrlAachlan for hrlpfitl vonvrrsxtions and suggrstions on the manuscript.

References Altsc*hul, 8. F. B Erickson, R. WT. (1986). A nonlinear measure of subalignment similarit\ and its significance Irrels. Bull. Math. Bid. 48. Kl7-632. Altsc~hul.S. F. & Lipman. D. ,J. (1990). Protein database searcahes for multiple alipnments. I’roc. ,Vtrt. Acad. Sri.. 1 ‘.S.il 87. 5.509-MIS. Altsc.hul. S. F.. (iish. IV.. Millrr. \I’.. Jlyrrs. E. \\‘. & f,ipntatl. I). .J. (1990). ISasic loc~al alignnwnt sear(,h tool. ./. Mol. I1iol. 215. 403-410. A rgos. I’ (1987).A sensitivfa procrdurr to cwtnfmw amino mid srqt~encw. ,I. Mol. Bid. 193. M5-XHi. Armstrong. .I.. l;ietnann. H.. Smrekrns. S.. f2ottirr. I’. f IVarrrn. (:. (1984). Sryurnw and top~~logj. of a mottrl intrawllular mtwtbrane protrin. El pl!-win-otrin. from a (wronavirus. Snttrw /I,o~~rlon), 308. 751 -i.ir’. .-\rratia. R. & M’atprman. &I. S. (1989). Thr Ertlos-~flrnyi strong law for patt,wn matching \I-itlt a given l)roportion of mismatches. .1nr,. /‘lo// 17. 115” I Ifi!).

.\rratia.

ft.. (~xtretnr

(cordon. vafu~

I.. 8r \\‘atrrman. theory f’or srcfuen~

.\I. S. (I%%). tnatc~hity

.\tt ,I U/L

,slrrt. 14. 97 I p!N3. Arratia. I<.. 1lorris. I’. 6 \~atrrtnatr. .\I. S. (I!)XX). Stochastic* srrat)hft~: large drviatiotts for, stJcfuen(‘t+ ~5 ith scort3. ./. .-lppl. /‘rot/. 25. lOtiP 19. Boguski. JI. S. HL States. I). .I. (1990). Molrcnlar st~cfrtrnc~c~ tlatahasrs aml their IISPS. Tti I’rofri,, b~‘:,cyi~erri~!q: .-I I’rcfctiml Jpprof~ch (Rws. Li. I<.. \Yetml. I<. & Strrtiberg. 31. .I. F:.. rtls). c,tlap. 5. I KT, I’rt3s. Osfi)rtl. Brooks, I). E.. Means. A. R., Wright, E. ,I.. Singh, S. I’. &, Tiver, K. K. (1986). Molecular cloning of the vDNA for two major androgen-dependent secret~ory proteins of 1X.5 kifodaltons synthesized by the rat epididymis.

within

genes

coding

for

,J. Mol. Ewl.

proteins.

19.

X37-448. Husain. I.. (1986). protein

Van Houten, B.. Thomas, I). (:. $ Sancar. A. Sequences of Es&&&a coli uvrA gene and reveal two potential ATP binding sites. J. Biol. Phrm. 261, 4895-4901. Tshioka. S., Takahashi. X. & Putnam. F. M’. (1986). Amino acid sequence of human plasma a 1 B-gfy~oprotein: homology to the immunogfohufin supergent’ family. Proc. Nat. .4rad. Sci.. I’.S.A. 83. %63-2367. Johnston. T. C‘., Hruska. K. S. xi Adams, L. F. (1989). Tht, ttuc~frotide sequence of the ltrzli: gene of l’ibrio hnrwyi and a comparison of the amino acid sequences of the acvf-protein synthet,ases from I’. hcrwp!yi iltltf

J. Hiol. Chenr. 261, 4956-4961,

1.. ,fi.wh&

Niochrm.

13ioph~y.s.

NW.

(‘orr/n,t~t/.

163.

93

(‘OllltOtl.

~1. \\‘..

iltlti

f//u/)

into

‘ISsch~richiu

I’.

~I~lSOtl. grtiw

t’or w/i

Kr Affiltt. I). I). iron(TrI)-~~~rt~i~~l~t~ott~t~ K-12. ./. krcturiol.

(f!IHi). i///t{’ transport

169. X%-t

:w4!), (‘o\van.

S. \V.. Sr\vcwtner. JI. K. & .lottrs. ‘I’. *\. (I!)!,O). (‘t.?stalloerttf)hic. rc+inemrnt of hunlatl st’rttttt rt~tittol Itit~~fity ftrot,rin >tt 2 ,\ rwolutiott. /‘HJ/P~N.v. 8. 44 61. I)ahf. M. K.. Franc,oz. E.. Sauritr. IV.. 1300s. \\‘.. .\lattsott. >I. I). & Hof’nung. 11. (19X!)). (‘cttttf)ariscttr of’ secfurnc~rs t’ronl tfrr tttaff3 regiotts 01’ ~Sct/t~cri//~//~r /!/p/l i)t/ uric, w tmf E///rrobcl~tf~/ crrro!,c’,, vs u ith Rwhrrichicr di K I?: a potwtial tw\\ rcyrtfator~ sitts in thr, intrroprronic~ wgion. No/. (:PM. h’r~/cc,/. 218. 1!)%107.

101. Karfin. S. & Aftschul. S. F. (1990). Methods for assessing the statistical significance of molrrufar sequence featurrs by using general scsoring schemes. f’roc. Snt. Ad.

Sri..

I:.S.A.

87.

%264-226X.

Karlin.

S.. I)embo. A. & Kawabata. T. (19!W). St,atisticaf cnonlftosition of high-scoring segments from molt~c~ufar sequences. Ann. Stat. 18. 57lL581. Kaumryer. .J. F.. f’olazzi. *I. 0. & Kotick. M. I’. (1986). ‘I%> mRSA for a protrinase inhibitor related to thts H l-30 domain of inter-a-trypsin inhibitor nlso rncwdt~s a- I-microglobulin (protein HC’). Strcl. :irirl.v

Rrs. 14. 7X39-78.50. f,ipmatt. I) .I. 8 Pearson. 12’. It. (19%). ILafJicl itntl sensitivts protein similarity st~archt~s. S~irt?~‘. 227. 1435 I-l-41. MrLad~latt. .-I. I). (1971). Tests for c*otnpring rrl~ttt~d amino acid sequences. (‘ytoc~hronre c and c~yt.oc&rontr

rss,. .I. Mol. Riol. 61. -M-4%3.

T)embo, A. 8: Karfin. S. (1991). Strong limit laws of empiricsal functionals for large exceedences of’ partial sums of’ I.1.I). variables. Ann. Prob. In the press. I)ra)~tra. 1). T.. LVlcLran. .J. \V.. \Vion. K. L. Trrtrt. .I. 11.. I)rahkitl. H. .\. K: T.awtr. R. >I. (t!!Xi). Hutnatt ;tftolif,of)rotc,itt I) gene: prtttr st’cfuen~~~‘. (*hrottlosotn(~ Ioc~aliztrtiot~. ant1 l~otr~olog~ to tfrtb 2 2,-gloltrtlitl sitftf’t’f’atnifv. /)S,4, 6. I!)!)--L’O4. f~t~ttg. I).‘F.. .lohttson. M. S. & I)oolittf~~. IL F. (f!)X;l). .\ligtting iltllitto iic,id srqrtetlc~tx c~otllftarisotl of IWtrlt~~otll~ uWtf tr~etflotls. ./. .llo/. /
Higgins. (‘. I”.. Hiles. I. I).. \2’hallry. K. &C .Jamirsott. I). ,I ( f!Wi). ?r;uc*leotitfr bindittg by ti~~tt~ttriutr ~~otttf)ont~nt~ of’ bac%rrial fteriplastnic binding ftt’ot~~ili-tf~~ftt~tl(l~~ttt transport systems. E~llHO J. 4. 1033~ 103!). Higgins, C. F., Wiles. I. I).. Safmond. (:. f’.. Gill. 1). R.. I)ownie. J. A.. Evans, I. .J,, Holland. I. B.. (iray. I,.. Kurkef, S. I).. Belt, A. W. & Hermodson. .11. .I. (1986). A family of related ATP-binding subunits coupled to many distinct biological processes in bacteria. ,V:*t,wrr (London), 323, 148-150. Holmyuist. R.. Goodman, M.. Conroy. T. & Czefusniak. .I. (1983). The spatial distribution of tixed mutations

Xredtrmatr. S. K. & LVunsc*h. (‘. I). (1970). :\ general twthotf applicable to the search for similarities in thus atttitto acad sryurnws cd two proteitts. ./. Mol. Hiol. 48. -K-45:3. Osorio-Ktsrse. >I. E.. Kersc*. I’. Kr (iibhs. ,\. (f!)H9). Suc,lrot idr sequences of the grnomc~ of rggpfattt trtosaic. tymorirus. C’irdoyy, 172. 54i-.i.% Park. 1.. JI. 8 Stauffer. (4. \‘. (1989). DNA srquettc’t> of thta )net( g~c~trr~ and its flanking regions from S~lnron~llrr /yphimtrri/r/tt LT:! and homologv with the (‘orrt*sf)on~littg scyueri~~~ of Eschrrichiu di. .Ifo/. f/m. fl~nrf 216. I64 -169. I’atthy. I,. ( 1987) I)et,ec%ing hotnology of distantlyrelat.c~rf ftrotoitts with (‘onsertsus sequen,‘es. .I. .Mol. Hiol. 198. 567 -577 I’t~arsotr. \\’ Ii. Di Lipman. I). .J. (l!%X). Irnftrovt~~f t)ools fi)r biological srqurnw c~omparison. I’rw Sect. .drcff/. Sri., I ‘.S.,1. 85. Z444- 2448. f’rc~h. AI. & %ac*hau. H. (:. (1984). In~t~~nnoglol~t~fit~ grnw of difYt~rrnt sulrtgroups are intrrdigitatrd within tht> \.K lO(‘1lS. .Vcrrl. Acids IIW. 12. 9229-R%Rli. 1’tGtsc.h. 11. (‘. & Boguski. M. S. (1990). Is af)ofil)of)rotrin I) it t~~atttt~~alian bifirr-binding f)rotritt! SPW Rioloqi.st. 2. l!F?Ofi. Qiu. F f<,;t>-. l’.. I struc.turt~ ctf’c*-kit: relationship with the (‘SF- t!Pl)(:F rcBc*efttor kinase f’atnif~~~ottc,ogt,t~i~ ac*tivatiott ot’ v-kit itrvol\,c*s cleletiotr of c~xtracelfufar tlomaitt anal (’ tctrtttitrus. EMHO ./. 7. 100% I01 I. Itajkovic,. :L Simonsen. .J. S.. Ijavis. Ii. fC. K: f
Amino

Acid

Substitution

Matrices

Smith.

T.

565

F. Bi Waterman,

common Rao.

.J. K. t&due

M.

(fQ87). exchanges

physical

ICew scoring based on

pramrtrrs.

lnt.

matrix residue

,J.

for amino vharacteristic~

Prpt.

J’rofpin

Smith. 29.

27ii-28 I Richardson. The root

IXlworth.

amino nodulrs

Lrftws. Rio&n.

wid of

51. R..

.I.

%I. .J. & Scawen. srqurnw f,road

fwan .J. M..

ft.. (irzelvzak. PC.. (‘hou. .J.

>I. C’.. Itlrnt,ific.ation

(‘ollins.

F. of the

c,harac,trrizatioti fO(i(i~

of’

245.

.J. I,..

of

X-3i. fZjommens,

Rozmahrl. I’lavsic*.

Rislrr.

11. 1).

legharmoplobin ( fYicio ~/IOU Krrem.

I,.,

S. c*ystica

Br Tsui. fibrosis

1 from

protein-binding sites from unaligned I’ror. Snt. Amd. Sci., IT.S..4. 86. Suzuki. T. (IQHQ). Amino acid sequrnw

f3. S.. .-\lon.

Z.. Zielenski. Drumm. 31.

,J.. I,ok. I,.. lannuzzi. I,.

comftlementary

S..

from

S..

Jliochim. Taylor. LIT.

(‘. (fQ8Q). cloning an<1

grnr:

I1S.A.

IM~nmr~.

>I.

0..

Drlacroix.

H.

acid substitutions A f’attrrn rwognition of a ttew and efic~ient

Sripnw.

Nr Henaut.

in

strtwturall~ aftftroac*h. scvring matrix.

I).

IHits Shp/“”

&

Kruskal.

curd

Tinrr

TITc~rpcs. Sfrirc!g

Thu Th,twry i\d(fisoli-~l’~,sl~~,

,Vtr~rottrolr~ulus: ( ‘~~mpurisor~.

rrttd

Procticr Reading.

Van of

Sellers.

sttftpl. 3. Il’ashington.

pp

?~.53G3.58. IN’.

I’. H. favolutiottar>

(f!)71).

On tlistanws.

Sat.

(fQ7X). .\fatriws fot In il tins of J’rotuirl hf. 0.. etl.). vol. 5.

Kiomed.

Rrs.

I’.

H,

swfuencw

( IQXa). fly

ftr’ott~in

.Yofuru

III

cucumber

A&.

998. identification

consensus

(19%). similarities.

(IQW).

The

fdrntifving

f)S;Z fragments. 1183-f 1x7. of a major gfobin

Pnrrrrnrrdinn

chi1rnsi.q.

292-%Q6. of protein

secfurnc~~

trmplatr

,J. No/.

alignment.

188. “:K-Px

Y.. Primary

Xagata. A.. struct,urts

Suzuki. Y. & of rat brain

deduced

from 12’.

Sccrncr. tre

172. IOHQ-1096.

\\‘att)rrttan.

caI)SA

K.

distributions LVeghr.

(f!MQ). D

.J. Kiol.

wcfttenc~r.

and J’hysiol.

Fitting

tliscwte

rvoltttiottary

(‘oppiet?rs. .J. the

events.

\\‘..

h Koucluet. swum fwotrins

x t B-glycoprotein

l&uw.

Y. in

(i..

(IQXX). Thr 1’01 in f)ig. X k

human.

( ‘owp.

I<. (f!)Xi).

f)hase

90B. 751-756.

.\I. S.. (:ordoti.

I rartsittons strlrc*tuw

(1971). to

A..

brtwrrn

tn horsr Kiochen.

Hayaisfti. 0. prostaglattdin

fO4f---1045. (‘orbin.

f)robaf)ility

theory SIAM

and computation .J. ,4ppl. Mnfh.

of 26.

f’af

mismatc~h

tern

recognitiott

drnsit,y.

Hull.

in Myth.

grnrtic, Rid.

46.

\\‘ilbrtr.

I,. & ;\twtia.

it1 srcfttrnw I’rcJc. .l’rrf.

\\’

Swd. I<. (1988). Thp 1~11s is a I)hosI)holil.‘iti-finkrcl (Lonrlor~).

Fry

333. 56X-570.

Ircrptor mrmbrane

ot

.J.

(19%).

]trotcGn rvolrttion. %alitc3i,t. )I.. (:ottzalez. R. ,I.. Sul+wtidr ratisfiirasr .Yu~,l. .lcirls

1

I). 8 killrr

of

tnatcnhrs .4cd ,Sci..

anal f’.S

tttt(.ItJi(. .-I 84.

ac.itl

f?:f!b

I”43 thra

501-~514. Sinttnolts. natural

by

‘l’andrrkrwkhovr. homology

Found..

‘iti-Xl. Srllcbrs.

sea

Rio&p. R. (IQM).

s\-nthrtasr C’hwu. 264. ITzzrll. T. B

RIA. Schwartz. f<. JI. K- f)ayhofr. Yf. 0. tlrtrc*ting distant relationships. ~~P(/IIPI~C~ rend StrucfurP (Dayhoff.

thr

Kid. I’ratfr.

A.

$

homology

.J. Mol. Hiol. 204. 1019

1029. .J. I{. (1983).

I).

S. &r fsttrks. 1’. of‘ nuc,fric. ac.itl

Res. 13, 645-656. Hartzelf. (:. C1’..

ilcids (:.

>I.

Stormo.

FEKS

I,.).

Lvaterman. distribution

(1975).

1073.

(IQXX). Amino wlatrtf prottsins. I)c+rtninat,iott Sankott‘,

T. F.. statistical ,\:ctcl.

M.,

Itientifi~ation

,J. Mol. Hio/. 147.

subsrqurnws.

19%197.

arid

KPS.

11. S. (1981).

molecular

On Nol. A..

Mwlpartida. srcfurnw gfw

Hr.<.

of

the

f’;\JI

tttatt,ix

tnoclrl

/lid. 15~01. 2. 434-447. (~rtrrrrro. ,\I. C’.. .\l;tttaliatto. F. 8~ .Iimrrtrz. thr h>pwntj.c*in

from

14. IX5

Sfrcptomp~.u

~IBXI,

.\. (f!M). I< phosf)hof/!/!/,,o.s~o/)ictrs.

of