J. theor. Biol. (2000) 206, 379}386 doi:10.1006/jtbi.2000.2138, available online at http://www.idealibrary.com on
Information Content of Protein Sequences OLAF WEISS*-, MIGUEL A. JIMED NEZ-MONTAN O? AND HANSPETER HERZEL* * Institute for ¹heoretical Biology, Humboldt ;niversity Berlin, Invalidenstr. 43, D-10115 Berlin, Germany and ?;niversidad de las AmeH ricas/Puebla Sta. Catarina MaH rtir, 72820 Puebla, MeH xico (Received on 10 March 2000, Accepted in revised form on 23 June 2000)
The complexity of large sets of non-redundant protein sequences is measured. This is done by estimating the Shannon entropy as well as applying compression algorithms to estimate the algorithmic complexity. The estimators are also applied to randomly generated surrogates of the protein data. Our results show that proteins are fairly close to random sequences. The entropy reduction due to correlations is only about 1%. However, precise estimations of the entropy of the source are not possible due to "nite sample e!ects. Compression algorithms also indicate that the redundancy is in the order of 1%. These results con"rm the idea that protein sequences can be regarded as slightly edited random strings. We discuss secondary structure and low-complexity regions as causes of the redundancy observed. The "ndings are related to numerical and biochemical experiments with random polypeptides. 2000 Academic Press
Introduction Living systems are not just characterized by metabolism and reproduction, but also by #ow of information. On the molecular level, information is carried by the sequences of DNA and proteins. There have been quite a few studies on the amount of information per base in DNA sequences. Compression algorithms applied to DNA sequences are discussed by Grumbach & Tahi (1994). Rivals et al. (1997) use a compression algorithm to detect tandem repeats. Entropies of DNA sequences have been discussed by various authors (Mantegna et al., 1994; Schmitt & Herzel, 1997; Herzel et al., 1994, and references therein). The mutual information, a measure intimately related to entropies, has been successfully used to predict protein coding regions in DNA (Gross et al., 2000). - Author to whom correspondence should be addressed. 0022}5193/00/190379#08 $35.00/0
In protein sequences it is much harder to give a precise estimate of the information per residue, as the list of symbols is longer and the sequences are shorter than in DNA. However, there have been a few attempts to quantify the information per amino acid in protein sequences. Yockey (1977) estimated the information per amino acid in the cytochrome c protein family to be 2.953 bit. Strait & Dewey (1996) took a rather small but representative set of 190 proteins and estimated the information in the sequences to be in between 2.4 and 2.6 bit. Rani & Mitra (1996) have estimated the per residue entropy in the SwissProt database to be 1.2 cal/deg\ which translates to approximately 3.5 bits per residue. We will discuss these approaches and their relation to our work in the discussion section. Furthermore, other techniques have been applied to quantify the &&non-randomness'' of protein sequences. White & Jacobs (1993) have analysed run statistics in proteins. Pande et al. 2000 Academic Press
380
O. WEISS E¹ A¸.
(1994) have mapped protein sequences to random walks to detect di!erences in the trajectories of a Brownian particle. Garnier et al. (1978), Kanehisa & Tsong (1980), Macchiato et al. (1985) and Weiss & Herzel (1998) have studied correlation functions in protein sequences. The related approach of Fourier analysis is also frequently applied (Liquori et al., 1986; Berman et al., 1994; Makeev & Tumanyan, 1996; Rackowsky, 1998; Chechetkin & Lobzin, 1999, and others). However, the results of these studies cannot be directly used to quantify the overall information content of protein sequences. Therefore, measures of complexity, such as the Shannon entropy (Shannon, 1948) have to be estimated. In this study, we apply the block entropies (Gatlin, 1972; Herzel, 1988) to the sets of protein sequences. First-order "nite sample corrections are applied. To improve the statistics of the entropy estimate, we reduce the 20-letter aminoacid alphabet to smaller alphabets of 2, 3, and 4 letters. In addition, we calculate the grammar complexity (Ebeling & JimeH nez-Montan o, 1980) of selected sequences to complement our studies. Our results suggest that the complexity of proteins is about 1% smaller than that of the random polypeptides of the same amino-acid composition. Furthermore, we "nd that a reliable estimate of the entropy of the source of protein sequences is not possible due to the limited amount of data available. Methods DIFFERENTIAL ENTROPIES
De,nitions Since the pioneering work of Shannon (1948) entropy is regarded as a measure of information in a probability distribution Po . It is de"ned as H"! P log P . (1) G G G In symbolic sequences such as proteins, the probabilities of n-tuples (n-words) pL are taken as the G underlying distribution. The resulting block entropies are a function of the word-size n: H "! pL log pL . L G G G
(2)
Di!erential entropies h are de"ned as the forL ward di!erences of the block entropies H : L h "H !H . L L> L
(3)
In terms of the concept of conditional entropies (Kullback, 1959), h can be seen as entropies of L the conditional probability P(x " S ), where L> L x is the letter following the n-tuple S . L> L The di!erential entropy h has an intuitive L information theoretical notion: given the n-tuple S , h is the average number of yes-or-noL L questions needed to "nd out the following symbol x using an optimal strategy. In other L> words, h is the degree of uncertainty about the L next symbol if the preceding n-tuple is known. The entropy of the source h" lim h L L
(4)
gives the information per symbol in a given sequence. It is also a lower bound for the compressibility of the sequence. This means that a string of N symbols cannot be compressed to less than hN bits. Estimating entropies from sequences of ,nite length Estimating block entropies from a sequence of "nite length will lead to an underestimation of the entropy due to "nite sample e!ects (Bas\ arin, 1959; Schmitt & Herzel, 1997). This e!ect is due to the fact that the number of n-words, growing exponentially with n, cannot be sampled accurately in a "nite sequence for large n. The systematic underestimation of H can be seen in Fig. 1 L where the entropies of randomly generated surrogate sequences are shown. The surrogates are created as follows: for each protein sequence, the amino-acid frequencies are counted to generate a concatenation of independent letters using these relative frequencies as probabilities. From theory we would expect a linear growth of the H L as it is displayed by the full line. In practice, a random sequence's entropy will be underestimated as shown by the dashed line. The H estimator HK , that simply uses relative L L frequencies as estimators of the probabilities is called the natural estimator. Harris (1975) has
381
PROTEIN SEQUENCES
FIG. 1. Block entropies H in the surrogate data of the L superfamily set (1.1 million amino acids) using the 20-letter alphabet. Due to "nite sample e!ects the entropy will be underestimated for large n (;} } } };). Even an estimate including a correction for the bias given in eqn (5) (#) ) ) ) ) )#) will di!er from the entropy known from theory (䊊*䊊). The inset shows the e!ect of the "nite sample and the bias correction on the estimation of di!erential entropies h . L
calculated all moments of the distribution of HK . L The "rst moment gives the bias of the estimate
M!1 1 #O , E(HK !H )"! L L 2N ln(2) N
(5)
where E( ) ) denotes the expectation value of the estimate of H . M is the number of n-tuples with L non-vanishing probability pL . We will subtract G the bias from the entropy estimates in all subsequent "gures. In Fig. 1, the entropy estimate without bias correction (dashed line) and with bias correction (dotted line) are displayed. It can be seen that the bias from eqn (5) can give only a minor correction as the number of n-tuples is increasing as 20L if the full amino-acid alphabet is used. Therefore, we reduce the amino-acid alphabet according to di!erent levels of a tree of amino acid classi"cations as proposed in JimeH nez-Montan o (1984). In this way, longer words and hence longer correlations can be analysed. We estimate the entropies in large nonredundant sets of protein sequences similar to the procedure used in Weiss & Herzel (1998). We take the n-tuple frequencies from all sequences to calculate the entropy. We will also
estimate the entropies in corresponding sets of randomized sequences. As data sets, we use a set of protein sequences with one protein of each superfamily. This superfamily set was introduced by White (1994).* Furthermore, we use the yeast and E. coli proteomes as data sets. Figure 2 shows the behavior of h in the correL sponding data sets. The minor di!erences between the curves can be attributed to size and overall composition of the data (see the caption). Comparison with the inset in Fig. 1 reveals that the values of h are dominated by "nite sample L e!ects for n*3. For small n there is only a tiny decay of the di!erential entropies indicating weak correlations of pairs and triples of amino acids. The second moment of HK (Harris, 1975) can L be used to obtain the expected variance of the estimate. However, the simulations we have performed show that this is only valid for nonoverlapping sampling of the word-frequencies. The overlapping sampling we perform yields a signi"cantly di!erent variance. We have observed the standard deviation (S.D.) of the h to be L around 5;10\ bits. GRAMMAR COMPLEXITY
Grammar complexity as introduced by Ebeling & JimeH nez-Montan o (1980) constitutes an attempt to determine the algorithmic complexity of a sequence introduced by Kolmogorov (1968). The essence of this concept is to compress a sequence by introducing new variables (syntactic categories). The length of the compressed sequence is then taken as a measure of the complexity of a sequence. The set of all "nite strings formed from the members of the alphabet X is denoted by X*. Any subset of X* is a language. A context-free grammar is a quadruple G"+N, ¹, P, S, where: 1. N is a "nite set of elements called nonterminals (syntactic categories), including the start symbol S. 2. ¹ is a "nite set of elements, called terminal symbols (letters of the alphabet). * The superfamily set can be accessed via FTP at blanco.biomol.uci.edu (ip number 128.200.58.4).
382
O. WEISS E¹ A¸.
K(w). Therefore, it might not reach the global minimum, but yields an upper bound of the grammar complexity. COMPRESSIBILITIES
FIG. 2. Di!erential entropies h in several data sets using L the 20-letter alphabet. The di!erences in h re#ect the indi vidual amino acid compositions of the data sets. At h the yeast proteome (䉫 ) } ) } ) } 䉫) data set has a smaller "nite sample e!ect as it comprises 2.9 million amino acids (Maa) as opposed to1.3 Maa in E. coli (䊐 } } } 䊐) and 1.1 Maa in the superfamily set (䊊**䊊).
3. P is a "nite set of ordered pairs APq, called production rules, such that q3(N6¹)* and A3N. Let us consider a grammar G such that the language generated by G consists of the single sequence w. The grammar complexity of w (Ebeling & JimeH nez-Montan o, 1980) is de"ned as follows: The complexity K(APq) of a production rule APq is de"ned by the length of the word on the right-hand side (r.h.s.). The complexity of a sequence w with respect to a given context-free grammar with non-terminal set N is de"ned by the sum K (w)" K(APq). , Z,
(6)
The grammar complexity is the minimum complexity with respect to the grammar: K(w)"min K (w). % ,
(7)
To estimate the grammar complexity, we use the algorithm presented in the previous publications (Ebeling & JimeH nez-Montan o, 1980; JimeH nez-Montan o et al., 1997). It consists of a gradient optimization method to minimize
In addition to the complexities described above, we made simple experiments using standard data compression tools. We compressed the sequences as well as surrogates using the Unix compress tool based on the modi"ed Lempel-Ziv algorithm described in Welch (1984), the gzip tool using Lempel-Ziv coding (Ziv & Lempel, 1977), as well as the bzip2 tool based on the Burrows}Wheeler algorithm (Burrows & Wheeler, 1994). We have applied gzip and bzip2 with the option for fastest (!1) and strongest (!9) compression. The compression was applied to concatenations of all sequences in the superfamily set separated by a special character. The same was done for the surrogate data. We measure the size of the compressed data "le divided by the size of the compressed surrogate data "le. Results Inspired by the hierarchy of amino-acid categorizations proposed by JimeH nez-Montan o (1984), we reduced the amino-acid alphabet to the two-, three-, and four-letter alphabets shown in Table 1. The di!erential entropies h corresponding to L these reduced alphabets in the superfamily set are shown in Fig. 3. The entropy in the two-letter alphabet sequences can be estimated up to n"14; for larger n the "nite sample e!ects will dominate the entropy estimate. The two-letter alphabet is essentially a hydrophilic/hydrophobic classi"cation of amino acids. Therefore, the decay of the h using L the two-letter alphabet in Fig. 3 implies, that the hydrophilic/hydrophobic two-residue correlations described in Weiss & Herzel (1998) are also re#ected in higher-order correlations. The three-letter alphabet, roughly classifying the amino acids into polar, charged, and hydrophobic residues, enables an estimate of the entropy up to n+9. The decay is nearly linear except for the kink at n"1. A similar behavior of
383
PROTEIN SEQUENCES
TABLE 1 Classi,cations of the amino acids to reduce the 20letter alphabet to two, three, or four letters Alphabet
Grouping of the amino acids
Two-letter
(P,A,G,S,T,Q,N,E,D,H,K,R) (C,L,I,V,F,M,Y,W) (P,A,G,S,T) (Q,N,E,D,H,K,R) (C,L,I,V,F,M,Y,W) (P,A,G,S,T) (Q,N,E,D) (H,K,R) (C,L,I,V,F,M,Y,W)
Three-letter Four-letter
TABLE 2 Grammar complexity of the protein data divided by the grammar complexity of the surrogates in the two-, three-, and four-letter alphabet. =e have generated an ensemble of surrogates to get an ensemble of ratios. ¹he values in the table are the means and S.D. of the ensembles Two-letter 0.994 ($0.004)
Three-letter
Four-letter
0.996 ($0.003)
0.996 ($0.004)
"rst 32 000 positions of reduced two-, three-, and four-letter alphabet versions of the data. In all three cases, the grammar complexity of the proteins was about 0.5% smaller than that of the surrogates (see Table 2). However, simulations have shown that the grammar complexity of the surrogate #uctuates rather strongly. The S.D. is at about 0.25% and +5}15% of the surrogates actually have a smaller grammar complexity than the protein data. COMPRESSIBILITY
FIG. 3. Di!erential entropies h in the superfamily set. The L amino acid alphabet has been reduced according to the classi"cations in Table 1. The top curve (*) indicates h using L a translation to the four-letter alphabet, the curve in the middle (䊐) to the three-letter alphabet, and the bottom curve (䊊) to the two-letter alphabet. The dotted lines depict the corresponding h in a set of surrogate sequences of the L same composition.
the h is observed in the four-letter alphabet L sequences, where we can estimate the h up to L n"6. In none of the sequences in reduced alphabets the di!erential entropies reach a plateau. If the h L would approach an asymptotic value in the range of the tuple size n where an estimate is reliable, this value would be a good estimate for the entropy of the source h. We do observe that the protein sequences display only 1% of redundancy as long as the estimations are reliable. GRAMMAR COMPLEXITY
Due to computational limitations the grammar complexity could only be estimated in the
We have compressed the superfamily set as well as a surrogate in the two-, three-, four-, and 20-letter alphabet versions. Figure 4 shows the compressional complexity for all alphabets and several tools. From an ensemble of 100 surrogates we found that the S.D. is below 0.1%. It can be seen in Fig. 4 that the compressibility of the protein data exceeds that of the surrogate data only by 0.3}1.4% depending on the alphabet and compression method used. The average lies around 1%. No signi"cant dependency on the alphabet used can be observed. LOW-COMPLEXITY REGIONS
The redundancy found could be caused by the constraints imposed by the secondary structure or by low-complexity regions. To assess the in#uence of low-complexity regions on the redundancy of the sequences, we have "ltered the sequence data with the program SEGA by Wootton & Federhen (1993). By default parameters A Available by FTP from ncbi.nlm.nih.gov in the directory/pub/seg/
384
O. WEISS E¹ A¸.
FIG. 4. Results obtained by measuring the complexity of protein sequences using various compression algorithms. The quantity displayed is the size of the compressed protein data "le divided by the size of the compressed surrogate data "le in the two, three, four & 20 alphabet (see methods): 䊊, compress; ¢, gzip-1; ¤, gzip-9; 䊐, bzip2-1; 䉫, bzip2-9.
SEG marked 6.4 of the residues in the superfamily set as low-complexity regions. Interestingly, 4.1 of the residues in the surrogate of the superfamily set were classi"ed as low-complexity regions. Applying the techniques described above to the sequence segments not marked as lowcomplexity regions, we "nd the redundancy to be slightly reduced compared with the original data (by about one-third). Summary and Discussion Already Monod (1969) expressed that protein sequences are not structured by any rules. The results observed for all measures of complexity applied in this study give support to the notion of protein sequences as &&slightly edited random sequences'' expressed by Pande et al. (1994), White & Jacobs (1993), and others. To express this in numbers: proteins have approximately 99% of the complexity of random polypeptides with the same amino acid composition. Random polypeptide sequences with equidistributed amino acid composition would have an entropy of h"log (20)+4.32 bits. If the average composition of the data used is considered, the entropy would be h"4.19 bits in the superfamily set. Pair and triplet correlations reduce the entropy further to h "4.18 and h "4.17 bits,
respectively. For longer words "nite sample e!ects dominate entropy estimations. Previous entropy estimates in protein sequences resulted in much lower values (see introduction). On the one hand, this can be a result of the quite di!erent protein sets used. If, for example, only the cytochrome c family is studied (Yockey, 1977) larger redundancies due to conserved residues lead to smaller entropies. On the other hand, "nite sample e!ects can mimic an entropy decay as illustrated in Fig. 1. For reduced alphabets we can probe longer words. Still the decay of the entropies h remains L very small. Our work shows, that it is currently not possible to determine the asymptotic value h from the protein data available. The amount of non-homologous protein sequence data needed for each new h grows like jL if j is the size of the L alphabet used. Therefore, even with the rapid data production nowadays, the entropy of the source of proteins will not be available in the near future. Even though proteins are the results of longtime and speci"c evolution, little regularities in the sequences can be detected by the means of information theory. Our rather rough estimate that proteins are about 1% less complex than random sequences stresses this point. Studies on protein structure prediction have shown that non-randomness in the sequence is not required for protein to be functional: Ptitsyn & Volkenstein (1986) suggest that proteins are slightly edited random polymers, where only the residues in key positions of the active center are "xed. This notion is supported by Saito et al. (1997), who made Monte Carlo simulations, in which they constrain only three positions of a model polymer. Gerstein et al. (1994) show that the small variations in the core size of the proteins may be a manifestation of the statistical &&law of large numbers'' rather than being caused by global constraints on the sequences. Golumbfskie et al. (1999) show that little deviation from randomness in the sequence is needed for a protein to recognize a speci"c receptor surface. Experimental work (Yamauchi et al., 1998) shows that soluble random polypeptides can form a compact molten globule. Our analysis using the SEG program indicates that about one-third of the redundancy observed
PROTEIN SEQUENCES
is caused by low-complexity regions, whereas the remainder is presumably caused by the rather weak correlations due to secondary structure described in Weiss & Herzel (1998). These two e!ects apparently cause the observed redundancy of about 1%. We acknowledge the programming support of Thomas Pohl. Furthermore, we are grateful to the Deutsche Forschungsgemeinschaft for funding. MAJM thanks the Innovationskolleg Theoretische Biologie Berlin for the hospitality during a sabbatical year and CONACYT (proyecto 32201-E), Mexico, for partial support.
REFERENCES BAS[ ARIN, G. P. (1959). On a statistical estimate for the entropy of a sequence of independent random variables. ¹eor.
385
JIMED NEZ-MONTAN O, M. A. (1984). On the syntactic structure of protein sequences and the concept of grammar complexity. Bull. Mat. Biol. 46, 641}659. JIMED NEZ-MONTAN O, M. A., POG SCHEL, T. & RAPP, P. E. (1997). A measure of the information content of neural spike trains: In: Biological Complexity, A Symposium (Acerenza, L., Alvarez, F. & Pomi, A., eds), pp. 113}142. D.I.R.A.C. Facultad de Ciencias, Montevideo, Uruguay. KANEHISA, M. I. & TSONG, T. Y. (1980). Hydrophobicity and protein structure. Biopolymers 19, 1617}1628. KOLMOGOROV, A. N. (1968). Three approaches to the de"nition of the concept quantity of information. IEEE ¹rans. Inf. ¹heory IT-14, 662}669. KULLBACK, S. (1959). Information ¹heory and Statistics. New York: Wiley. LIQUORI, A. M., RIPAMONTI, A., SADUN, C., OTTANI, S. & BRAGA, D. (1986). Pattern recognition of sequence similarities in globular proteins by fourier analysis: a novel approach to molecular evolution. J. Mol. Evol. 23, 80}87. MACCHIATO, M. F., CUOMO, V. & TRAMONTANO, A. (1985). Determination of the autocorrelation orders of proteins. Eur. J. Biochem. 149, 375}379. MAKEEV, V. J. & TUMANYAN, V. G. (1996). Search of periodicities in primary structure of biopolymers: a general Fourier approach. CABIOS 12, 49}54. MANTEGNA, R. N., BULDYREV, S. V., GOLDBERGER, A. L., HAVLIN, S., PENG, C.-K., SIMONS, M. & STANLEY, H. E. (1994). Linguistic features of non-coding DNA sequences. Phys. Rev. ¸ett. 73, 3169}3172. MONOD, J. (1969). On symmetry and function of biological systems. In: Proc. 11th Nobel Symp. (Engstrom, A. & Strondberg, B., eds), pp. 15}17. New York: Wiley Interscience. PANDE, S. V., GROSBERG, A. Y. & TANAKA, T. (1994). Nonrandomness in protein sequences: evidence for a physically driven stage of evolution? Proc. Natl. Acad. Sci. ;.S.A. 91, 12 972}12 975. PTITSYN, O. B. & VOLKENSTEIN, M. V. (1986). Protein structures and neutral theory of evolution. J. Biomol. Struct. Dyn. 4, 137}156. RACKOWSKY, S. (1998). &&Hidden'' sequence periodicities and protein architecture. Proc. Natl. Acad. Sci. ;.S.A. 95, 8580}8584. RANI, M. & MITRA, C. K. (1996). Pair preferences: a quantitative measure of regularities in protein sequences. J. Biomol. Struct. Dynam. 13, 935}944. RIVALS, E., DELGRANGE, O., DELAHAYE, J.-P., DAUCHET, M., DELORME, M.-O., HED NAULT, A. & OLLIVIER, E. (1997). Detection of signi"cant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences. CABIOS 13, 131}136. SAITO, S., SASAI, M. & YOMO, T. (1997). Evolution of the folding ability of proteins through functional selection. Proc. Natl. Acad. Sci. ;.S.A. 94, 11 324}11 328. SCHMITT, A. O. & HERZEL, H. (1997). Estimating the entropy of DNA sequences. J. theor. Biol. 188, 369}377. SHANNON, C. E. (1948). A mathematical theory of communication. ¹he Bell System ¹ech. J. 27, 379}423, 623}656. STRAIT, B. J. & DEWEY, G. (1996). The Shannon information entropy of protein sequences. Biophys. J. 71, 148 }155. WEISS, O. & HERZEL, H. (1998). Correlations in protein sequences and property codes. J. theor. Biol. 190, 341}353. WELCH, T. A. (1984). A technique for high performance data compression. IEEE Comput. 17, 8}19.
386
O. WEISS E¹ A¸.
WHITE, S. H. (1994). Global statistics of protein sequences: implications for the origin, evolution, and prediction of structure. Annu. Rev. Biophys. Biomolec. Struct. 23, 407}439. WHITE, S. H. & JACOBS, R. E. (1993). The evolution of proteins from random amino acid sequences. I. Evidence from the lengthwise distribution of amino acids in modern proteins. J. Mol. Evol. 36, 79}95. WOOTTON, J. C. & FEDERHEN, S. (1993). Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17, 149}163.
YAMAUCHI, A., YOMO, T., TANAKA, F., PRIJAMBADAS, I. D., OHHASHI, S., YAMAMOTO, K., SHIMA, Y., OGASAHARA, K., YUTANI, K., KATAOKA, M. & URABE, I. (1998). Characterization of soluble arti"cial proteins with random sequences. FEBS ¸ett. 421, 147}151. YOCKEY, H. P. (1977). On the information content of cytochrome. J. theor. Biol. 67, 345}376. ZIV, J. & LEMPEL, A. (1977). A universal algorithm for sequential data compression. IEEE ¹rans. Inf. ¹heor. IT-23, 337}343.