Integral and differential form of the protein folding problem

Integral and differential form of the protein folding problem

Physics of Life Reviews 1 (2004) 103–127 www.elsevier.com/locate/plrev Integral and differential form of the protein folding problem Anna Tramontano ...

737KB Sizes 4 Downloads 78 Views

Physics of Life Reviews 1 (2004) 103–127 www.elsevier.com/locate/plrev

Integral and differential form of the protein folding problem Anna Tramontano Department of Biochemical Sciences “A. Rossi Fanelli”, University of Rome “La Sapienza”, P. le Aldo Moro 5, 00185 Rome, Italy Accepted 28 April 2004 Available online 11 June 2004 Communicated by E. Di Mauro

Abstract The availability of the complete genomic sequences of many species, including human, has raised enormous expectations in medicine, pharmacology, ecology, biotechnology and forensic sciences. However, knowledge is only a first step toward understanding, and we are only at the early stage of a scientific process that might lead us to satisfy all the expectations raised by the genomic projects. In this review I will discuss the present status of computational methods that attempt to infer the unique three-dimensional structure of proteins from their amino acid sequences. Although this problem has been defined as the “holy grail” of biology, it represents only one of the many hurdles in our path towards the understanding of life at a molecular level.  2004 Elsevier B.V. All rights reserved.

Contents 1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

2

The folding problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

3

Looking at solved examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4

Protein evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5

The differential form of the folding problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6

Detecting evolutionary relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 E-mail address: [email protected] (A. Tramontano).

1571-0645/$ – see front matter  2004 Elsevier B.V. All rights reserved. doi:10.1016/j.plrev.2004.05.002

104

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

7

Intermediate and profile based search methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

8

Building a comparative model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

9

The integral form of the folding problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

10

Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

1. Introduction The major components of living organisms are proteins, linear polymers of amino acids, whose specific sequence of monomers is dictated by the genes of the organism. The genes, linear polymers of nucleotides, are directly translated into the linear sequence of amino acids of the encoded protein through a practically universally conserved code. This conceptually simple process is indeed biochemically rather complex, it involves a number of cellular machineries and is subject to an intricate network of control mechanisms. The decoding of the sequence of a genome, especially of a eukaryote such as human, is but a first step in the path towards the understanding of the chemico-physical mechanisms of what we call life. Given a genome, we need to find which are the regions that encode a function, i.e., the genes and, for example, in human not more than 3% of the 3 billion nucleotides of our genome encodes for genes [1,2]. They are not necessarily contiguous and are often derived by joining segments interrupted by non-coding regions (introns) not clearly distinguishable from the coding ones [3–5]. In some cases different combinations of the coding segments of the same gene can be translated into different products in different cell types or in different conditions; the detection of these cases and their interpretation is yet another of the challenges of genomic research [6–12]. Generally speaking, gene finding is not yet a completely solved problem and we are still struggling to detect all the genes in the available genomes and to correctly assemble their segments [13–18]. Once the boundaries of a gene have been identified, its sequence can be directly translated into the amino acid sequence of the encoded protein. This is certainly not sufficient to tell us when, where and at which level the protein will be synthesized during the life of the organism (different computational and experimental techniques are needed to find the genomic regions that control the translation of genes [19–26]). However the atomic composition of the nascent protein can be immediately deduced by its amino acid sequence. Is this sufficient to infer the cellular and molecular function that the protein performs? Proteins mediate the majority of the functions of an organism, ranging from recognizing foreign proteins to activating defence mechanisms, from triggering the translation of genes to performing enzymatic catalysis, from organizing the organism body to sensing external conditions and all these functions are, by and large, determined by the proteins’ three-dimensional structure.

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

105

Natural proteins spontaneously assume a unique three-dimensional structure [27]. This is not a general property of polymers, while it is shared, with rare exceptions, by all natural functional proteins. It is achieved by the interplay between molecular evolution and environment-driven selective pressure. Selective pressure acts on function, function requires a unique structure and therefore protein sequences are selected for being able to fold into a unique structure [28]. This peculiarity of native protein sequences is the reason for their exceptional plasticity and versatility and leads to the exquisite specificity of these macromolecules as well as to a very precise control of their activity through complex networks of interactions. If we could understand the relationship between protein sequence, structure and function, we could make an effective use the large body of genetic information available for many organisms, human included, list all their functions and study their interplay in shaping life. This intellectually stimulating problem, which is among the most challenging of modern science, is still unsolved after many years of efforts and frustrations [29]. Here I will focus on the sequence–structure relationship, however the problem of deducing protein function from structure is equally fascinating and, unfortunately, almost as elusive [30–32].

2. The folding problem A protein chain, even when chemically synthesized, achieves the same native structure independently from the starting conditions and does so spontaneously, strongly suggesting that its native state is thermodynamically the most stable under physiological conditions [33–40]. The folding problem, that is the problem of inferring the structure of a protein from its amino acid sequence, can therefore be solved by searching for the most stable structure among the set of conformations available to a given protein chain. There are two main obstacles in this procedure. The number of possible conformational states of a protein is enormous (at least 2100 for a chain of 100 amino acids) and is therefore computationally intractable. In fact, this observation is also relevant for the development of a folding theory as it is obvious that a protein cannot explore such a large number of states in a reasonable time frame, as required by the hypothesis that the native structure is thermodynamically the most stable. This paradox, known as the Levinthal paradox [41], has been relatively recently solved by the “funnel” theory of protein folding according to which, during the folding process, the loss of entropy of the protein chain is immediately compensated by an energy gain [39,42–47]. The theory allows to estimate a protein folding time in good agreement with experimental observations [48]. The reader is strongly encouraged to read the recent excellent review by Finkelstein and Galzitskaya [49] in this same series that provides a very clear treatment of the problem and of its solution. The funnel theory might, in the future, be of substantial help in the computational simulation of the folding process, but this will not necessarily allow the inference of the structure of a protein from its sequence alone because of another major difficulty, namely our ability to correctly evaluate the energy of a given protein chain configuration. The quantum-mechanical treatment of a protein structure is out of our reach and it will remain so for many years, and generations, to come and therefore we need to approximate the protein as a classical object. This introduces an error in the evaluation of its free energy that is, unfortunately, too large for allowing us to distinguish between the native structure and other, energetically similar, states. The

106

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

problem lies in the fact that proteins are only marginally stable, their free energy of unfolding is of the order of a few Kcal/mol and the parameterisations used in our classical mechanic treatment is too crude and approximate for detecting the native structure among the many alternative configurations that can be assumed by the amino acid chain [50]. This implies that we cannot, at present, solve the folding problem ab initio, that is only on the basis of the physico-chemical properties of the amino acid chain: we have to resort to different methodologies and try to empirically derive a code relating protein sequence and structure. Admittedly, even if we found such a code and we could correctly infer (or predict, as we usually say) the structure of all the proteins in the universe, this would not be completely satisfactory from an intellectual point of view. However, the impact on the life sciences would be enormous: we could know the structure of any protein of any organism that we care to sequence, we could obtain detailed information on proteins of biochemical, biomedical, pharmaceutical interest and be more effective in designing therapeutically effective drugs and vaccines. I will therefore discuss here some available methods that allow the prediction of protein structures not on the basis of a theoretical understating of the folding process, but based on the wealth of information about protein structures accumulated in the past years. We can observe instances of the solutions of the problem, that is proteins for which both the sequence and the structure are known, and try to derive rules that can be more or less generally applied to proteins for which only the amino acid sequence is known.

3. Looking at solved examples There are several thousands solved protein structure in our collection [51]. They have different levels of redundancy: in some cases different structures are available for the same protein analysed in different experimental conditions, e.g., bound to different ligands; in some other cases they represent structure of proteins the sequence of which has been modified by replacing one or a few amino acids, or by artificially linking different protein chains. Often they are proteins with the same function, but from different organisms, and therefore evolutionary related. Their structural relationship will be discussed in the next section. At first sight, a protein structure is an intricate assembly with no apparent regularities (Fig. 1a) with the atoms of the amino acid chain closely packed against each other and it seems to defeat any attempt of classification, but the picture becomes much clearer if we consider its chain of amino acids (Fig. 1b). Amino acids contain a carbon atom called Cα, linked to a carboxylic groups, an amidic group, a hydrogen atom, and a variable chemical group (the side chain). There are twenty amino acids in natural proteins, differing by their side chains. In a protein they are linked together (in the order dictated by the gene sequence) by a chemical bond between the carboxylic group of one amino acids and the amidic group of the adjacent one (Fig. 2). The chain connecting the amidic group, the Cα and the carboxylic group of the amino acids of a protein is called the main chain or backbone. The observation of the topology of the backbone reveals that almost all proteins contain regions with regular repetitive structures called α-helices and β-sheets as shown in Fig. 1 [52–54]. There exist proteins formed by only α-helices, by only β-sheets or by both (and obviously by the regions joining them) [55]. Some proteins are formed by more than one globular structure that seem to be independent on each other, called domains. The secondary structure elements and their topological arrangement in a domain form what is called a “fold” and our database contains several protein domains with different structure but recognizably similar folds [56–63]. Some proteins are formed by more than one polypeptide chain.

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

107

(a)

(b)

(c) Fig. 1. The hierarchical structure of proteins. From left to right: a α-helix, a two-stranded anti-parallel β-sheet, a α–β–α motif formed by one α-helix packed against a β-sheet, the tertiary structure of the triose phosphate isomerase enzyme from chicken (PDB code: 8TIM), the quaternary structure of the same protein. In (a) every atom is represented by a sphere (green = carbon, red = oxygen, blue = nitrogen). In (b) only the backbone is shown. In (c) a cartoon representation of the structures where α-helices are shown as red cylinders and β strands as green arrows.

A protein is therefore defined by its primary (the amino acid sequence), secondary (the α-helices and β-sheets), tertiary (the folded conformation) and quaternary (set of chains) structure.

4. Protein evolution By and large, proteins evolve by accumulating mutations (amino acid replacements, insertions and deletions) which, if not destabilizing, can be transmitted to the progeny and be fixed in the population.

108

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

(a)

(b)

(c)

Fig. 2. The twenty natural amino acids (a). The general structure of an amino acid is shown in (b). In the dipeptide shown in (c) the ticker line indicates the backbone or main-chain.

At some stage of the evolution of a species, some individuals might diverge sufficiently to give raise to a different species, i.e., become unable to cross with the other members of the originating species. What happens to protein structures upon mutations? The protein’s limited stability suggests that the delicate balance between destabilizing and stabilizing forces might be easily destroyed and the protein might not be able to fold (this is indeed the most likely outcome of a random replacement of one amino

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

109

Fig. 3. The evolution of protein structures. Two evolutionarily related proteins: triose phosphate isomerase from a small worm living in soil and reaching at most 1 mm in size called C. elegans (left, PDB code: 1MO0) and from chicken (right, PDB code:8TIM). Although some differences are noticeable, the overall structure is conserved.

acid in a laboratory experiment). But, during evolution, function has to be preserved, therefore all the proteins that we observe can only contain non-destabilizing mutations with respect to their immediate ancestor sequence. Can a small change destabilize the original protein structure and stabilize a completely different one, preserving stability, function, folding ability, etc.? This is rather unlikely, and indeed never observed. We are therefore left with one possibility: evolutionary related proteins, that is proteins derived by a common ancestor via the accumulation of small changes, cannot but have similar structure, where mutations have been accommodated only causing small local rearrangements. If the number of changes, that is the evolutionary distance, is high these local rearrangements can cumulatively affect the protein structure and produce relevant distortions, but the general architecture, that is the fold, of the protein has to be conserved [64,65]. On the other hand, if two proteins have evolved from a common ancestor, it is likely that a sufficient proportion of their sequences has remained unchanged so that an evolutionary relationship can be deduced by their comparative analysis [66–71]. Therefore the detection of evolutionary relationships between proteins is of outmost interest in the field of protein structural and functional prediction: if we can infer that two proteins are homologous, that is evolutionary related, information about the structure of one can be transferred to the other. As we will see, the extent to which this is possible and effective depends on the evolutionary distance between the two proteins (Fig. 3). Functional information can also be deduced from evolutionary relationships, however in this case some further complications arise. During evolution the sequence of a gene could be duplicated, the function of one of them is then redundant and not subjected to any evolutionary pressure other than that of not being deleterious to the organism. During evolution, this second “free” copy could evolve a new function and therefore a common evolutionary origin does not necessarily imply functional conservation. This problem can be addressed in different ways that I will not discuss here, but it should be mentioned that it represents probably the most difficult hurdle in the path to function discovery [30].

110

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

5. The differential form of the folding problem The folding problem can be formulated in its differential, rather than integral form: instead of asking what is the structure of a given protein sequence, we can ask what is the expected structural effect of sequence changes on a known protein structure. In general, it is not easy to model the effect of an amino acid change on a protein structure since this would imply precise energetic calculations and we have discussed the reasons why this is very difficult to do. However, if we know that the amino acid replacement, or insertion or deletion, has been accepted in the protein structure, the problem might become more tractable. Since we expect that sequence variations between two homologous natural proteins have been accommodated into the same general fold, the knowledge of the structure of one of them can give us an initial approximate model for the structure of the other onto which we can try to model the effect of the mutations. This method for predicting the structure of a protein is called, understandably, comparative or homology modelling [72–85]. We will first discuss how to detect that two proteins are homologous and then the techniques that can be employed to model sequence changes.

6. Detecting evolutionary relationships Given two protein sequences, we can estimate the probability that they are originated from a common ancestral sequence by calculating the most probable evolutionary path between them and estimate the likelihood that such a path exists. Let us assume that the most probable path is the most parsimonious one, i.e., the path that requires the minimum number of substitutions, insertions and deletions between the common ancestor and each of the sequences or, almost equivalently, between the two sequences. In this hypothesis, the problem can be stated as follows: Problem 1: what is the minimum number of edit operations needed to transform one protein sequence into the other and how likely it is that such a number arises from two unrelated protein sequences? From an algorithmic point of view, the problem becomes: Problem 2: Given two strings in an alphabet of twenty characters, what is the minimum number of edit operations needed to transform one string into the other and how likely it is that such a number arises from two random strings? Although the two problems seem equivalent, they are not because proteins are not just strings of amino acids! Indeed, the second problem can be solved exactly, while the first still represents an open issue in bioinformatics. Let’s first see how we solve problem 2 and then we will discuss the problems of extending the solution to problem 1. The algorithm that we need is called dynamic programming [86,87] and we will illustrate it using the example in Fig. 4. Fig. 4a shows a matrix where each row corresponds to one character of one string and each column to one character of the other. An element of the matrix is set to 1 if the characters corresponding to its row and column are identical and 0 otherwise. A correspondence (alignment) between the two strings

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

111

Fig. 4. The Needleman and Wunsch alignment algorithm [86]. (a) The scoring matrix for two sequences. The two paths in the matrix correspond to the alignments shown in (b). In (c) and (d) two partially filled cumulative matrices are shown. In the first, initial insertions and deletions are not penalized, in the second a penalty of 0.6 has been assigned. (e) The complete cumulative matrix. Arrows starting from each cell indicate which cells have been used to fill them. Note that, in some cases, there are pointers to more than one cell. In (f) and (g) two cumulative matrices for the same alignment, obtained with a penalty value of 1 and 0.2, respectively. Only pointers including cells in the optimal paths are shown.

112

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

is a path in the matrix, as illustrated in the figure. The alignment can be global, i.e., include the whole strings, or local, including only regions of them. To find the optimal global alignment under our hypotheses, we need the correspondence between the two strings requiring the minimum number of editing operations, that is in which the number of identical corresponding characters is maximum and this is equivalent to say that we need to find the path that goes from the upper left corner to the lower right corner that includes more “1” or, equivalently, for which the sum of the values of the cells in the path is maximum. Here we are assuming that characters are either identical (scoring 1 in our matrix) or different (scoring 0), but the reasoning can be easily extended to cases where we can assign a similarity score to each pair of characters. Also, we can penalize insertions and deletions in the alignment in view of the fact that they are less likely to occur during evolution. Now, let us assume that we have the matrix in Fig. 3c, called the cumulative matrix, where in each cell we write the maximum score that can be achieved by any path ending in that cell. The column and row labelled “0” correspond to inserting or deleting at the beginning of either strings. If we do not require the first characters of each string to be aligned, we can set the scores in the first row and first column to 0. Alternatively, we can add a penalty for shifting them, for example, subtracting 0.6 for each shifted character as shown in Fig. 4d. Calculating the value in the (1, 1) cell is trivial: a path including this cell has to include either the cell (0, 0) or the cell (1, 0) or the cell (0, 1). The maximum score achievable by any path ending in (1, 1) is the maximum between: The value in (0, 0) + 1 (since the characters in 1 and 1 are identical and therefore we gain 1 by passing through it); The value in (0, 1) + 1 − the penalty value for an insertion (0.6 in the example); The value in (1, 0) + 1 − the penalty value for an insertion (0.6 in the example). We will write this maximum value in the cell (1, 1) and store a pointer to the cell (0, 0), the one we used to obtain it. Once (1, 1) is filled, with the same strategy we can calculate the values of the cells (1, 2), (2, 1) and (2, 2) and so on, as shown in Fig. 3d. The maximum achievable score for a global alignment of the two strings cannot but be the value in the last cell (8, 10), and this can be obtained if we passed through the cell (7, 9) which was filled using the value in (6, 8) and so on. In other words finding the best path only requires walking backward from the (10, 8) to the (0, 0) cell following the pointers. The algorithm described above is known under the name of “Needleman and Wunsch global alignment” [86] and it can be easily extended to find the best local alignment by starting from the maximum value in the cumulative matrix and work our way in both directions until we find a 0 (Smith and Waterman algorithm [87]). In our example these two alignments coincide. The Needleman and Wunsch algorithm guarantees that one optimal path is found and we can use it to find the alignment of two protein sequences that maximizes their similarity. Assuming that amino acids are either identical or different is a rather crude approximation of biological reality: amino acids with similar chemico-physical properties are more likely to be replaced by each other during evolution than totally different ones. We need to calculate the likelihood that two amino acids are replaced by each other during evolution. Given a set of aligned evolutionary related proteins, we can calculate for each pair of amino acids {i, j } the frequency fi,j with which the two amino acids are found in corresponding positions in the alignment and the frequencies fi and fj with which the amino acids i and j occur in the sequences. If the number of aligned sequences is sufficiently high, the ratio fi,j /fi fj is a good estimate of the likelihood that the amino acids i and j are replaced by each other during evolution. Since we need additive scores in our cumulative matrices, we generally use the log2 of these numbers.

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

113

The initial alignments used to derive the frequencies should be unambiguous, and therefore between very similar sequences or regions of sequences. The values obtained for these cases can then be extrapolated to detect more distant relationships. The set of matrices proposed in 1978 by M.O. Dayhoff are called PAM matrices [66]. A PAM (Point Accepted Mutation) is a single amino acid substitution that has been evolutionarily accepted, i.e., transmitted to the progeny and two sequences are at 1 PAM distance if they can be converted into each other assuming an average of 1 PAM every 100 amino acids. The PAM1 matrix is constructed deriving the substitution frequencies from alignments between pairs of proteins at 1 PAM distance from each other. The PAM2 matrix can then be obtained by multiplying the 1 PAM matrix by itself, the PAM3 by multiplying the PAM2 matrix by the PAM1 matrix and so on, iteratively. The larger the number of the matrix, the more suitable it is to detect more distant evolutionary relationships. The BLOSUM matrices are instead derived using multiple local alignments of very conserved regions in homologous proteins [88]. They are also a series of matrices. A BLOSUM-X matrix is derived from alignments such that no sequence is more than X% identical to any other sequence in the alignment. Opposite to the PAM case, here a larger number indicates a matrix more suitable for aligning more closely related sequences. Both sets of matrices are derived from very similar protein sequences or regions of protein sequences, which therefore do not include insertions and deletions, as these are rarer events in evolution. Hence, the penalty for insertions and deletions needs to be assigned empirically [89,90]. In our simple example, a change in the penalty value from 0.6 to 1 produces a different alignment as shown in Figs. 4f and 4g. The above described procedure, although with the limitations imposed by the usage of empirically derived parameters, allows us to align two sequences and calculate a score. The issue is whether the resulting score is significant, i.e., whether it is compatible with what is expected from aligning two unrelated protein sequences or whether it is indicative of an evolutionary relationship. This is especially relevant when we need to compare a protein sequence with an entire database of known protein sequences to detect all proteins evolutionary related to our query sequence. We cannot simply compare the obtained score with the expected score for two random strings of twenty letters, as protein sequences are not random. We need to calculate the expected score for two unrelated protein sequences but it is not easy to obtain a set of sequences certainly unrelated to the query. Database search methods use different approaches for highlighting relationships likely to be significant [70,91,92]. The two widely used methods (FASTA and BLAST) do not perform a full fledged global or local alignment of the query sequence with every known sequence because the size of the database is too large for this approach to be practical. They both use approximations to first exclude sequences which are unlikely to be evolutionarily related to the query. We will not discuss these approximations, both because they are implementation-related and because they do not represent, in first approximation, the main difference between the two methods, which instead lies in the statistical evaluation of the results. In both cases the distribution of scores obtained by comparing the query sequence with all sequences of the database are evaluated with respect to a distribution of the scores deemed to be “randomly expected”. In FASTA, the randomly expected distribution is estimated by repeating the search several times on subsets of the database using, as query, sequences obtained by reshuffling the original sequence. In this way, the amino acid composition (and therefore any bias in it) is taken into account in calculating the probability that a score represents a true evolutionary relationship. In BLAST the randomly expected distribution is obtained once and for all (for each substitution matrix and insertion/deletion penalty scheme) using a set of amino acid sequences with the average composition of the data base. This latter method is

114

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

clearly faster, but more prone to errors, especially since there are many protein sequences with unusual composition. To reduce this bias effect, BLAST does not use regions of the query proteins with unusual amino acids composition in the search, and this can sometimes mask true relationships.

7. Intermediate and profile based search methods Evolutionary relationships are transitive, two proteins evolutionarily related to a third one are themselves related and this can be used for detecting more distant relationships. One can search the database with the query sequence, collect its putatively homologues and repeat the search using each of them as query [93,94]. Another approach consists in retrieving a set of sequences related to the query, derive a statistical model for the family of related proteins and use it to estimate the likelihood that other sequences fit the model. These approaches require the construction of a multiple sequence alignment of the proteins of the family, that is aligning at the same time more sequences from the same family. Although dynamic programming methods cannot be extended to more than a few sequences, progressive heuristic methods can be used to solve this problem and construct a statistical model of a family of related protein sequences. There are several approaches to construct the statistical models [70,71,95–102] and I cannot describe them in detail here but, as it is widely used, we will just mention one strategy that consists in building a “profile” of the family: each position of the multiple sequence alignment is represented by a column in a matrix with twenty rows, each corresponding to one of the twenty amino acids. Each cell contains the logarithm of the frequency of the amino acid of the row in the position corresponding to the column (Fig. 5). The probability that a new sequence belongs to the family can be calculated by computing the likelihood that its sequence fits the profile (Fig. 5d) and comparing it with a randomly expected distribution of scores. The very successful PSI-BLAST [103,104] method takes advantage of this approach: it first searches the database with a query sequence, then collects and aligns related sequences found in the search, builds a profile and uses it to search the database again. The procedure can be iteratively repeated until no new sequences matching the profile are found.

8. Building a comparative model The most widely used method for predicting the structure of a protein is based on the detection of an evolutionary relationship between the protein of interest (target) and one or more proteins of known, experimentally determined, structure (templates). The procedure is conceptually very simple: if we assume that the sequence alignment between the target and template sequences (obtained with one of the methods described above) reflects the true evolutionary relationship between their amino acids, then we can assume that most of them have preserved the same relative position in the structure and use the coordinates of the backbone of the template(s) as a first approximation of the coordinates of the backbone of the target. We then need to model the rearrangements of the backbone caused by the amino acid differences between the two proteins. Next, insertions and deletions can be present in the alignment and need to be modelled separately as we cannot “copy”

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

115

(a) Seq 1 Seq 2 Seq 3 Seq 4 Seq 5 Seq 6 Seq 7 Seq 8 Seq 9 Seq 10 Seq 11 Seq 12 Seq 13 Seq 14

G A A A A G A A A G G G A G

S S T T T S T T T S S S S S

Q Q E E E E E E E Q Q Q Q T

T T T T T T T T T T T T T T

G G G G G G G G G G G G G G

N N K K K R K R R T T T T N

A A S S S A S S S A A A A T

K R E Q Q Q E E E E E E E E

A R T T A S M Q R E D D D T

V V L L Y Y Y Y F F Y F Y A

A A A A A A A A A A A A A A

E E N Q K Q R R R N N Y K E

L Q A S R T Q K Q R R K R K

I L L L L L L L L L L L F F

A G R C N C G G V S S S S S

K A D S S E R D E E K R T K

G R D L M I L I L I D E E E

I A L F L F F F L F A L A L

I S L S N K R S G K . H K V

E E A . . . . . . . H S A A

G G K A A A A A A V Y F F F

K M L F F F F F F F G G N N

D D N N N D D H N H . . . .

V A V T S A P S A S M L L L

N R K K R K R Q Q R R K T N

8 0 0 0 ... 0 0 0 0 0 0

0 0 0 0 ... 0 8 6 0 0 0

0 0 0 7 ... 0 0 1 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

7 0 0 0

0 0 0 9

2 0 3 1

1 14 0 0 0 0 0 0 0 0 0 3

1 0 0 0

0 0 0 0

1 2 0 0

1 0 2 3

0 0 2 3

3 0 0 0

0 0 0 0

3 0 0 2

6 0 0 0

0 0 0 0

0 0 4 0

3 0 0 0

0 0 0 0

0 0 14 0 0 0

0 0 0 0 0 0

3 0 4 0 0 0

0 6 1 0 0 0

1 0 0 0 0 0

2 1 3 0 0 0

0 0 0 2 6 0

4 1 1 0 0 0

0 0 0 0 0 0

1 5 0 1 0 0

2 2 1 0 0 0

1 0 0 0 0 0

0 0 0 0 0 0

1 3 0 1 0 0

0 1 0 0 0 0

0 0 0 1 1 0

0 0 0 0 0 0

0 0 0 0 0 0

0 3 1 2 0 0

5 0 1 0 0 0

−1 −4 −4 −4 ... −4 −4 −4 −4 −4 −4

−4 −4 −4 −4 ... −4 −1 −1 −4 −4 −4

−4 −4 −4 −1 ... −4 −4 −4 −4 −4 −4

−4 −4 −4 −4

−4 −4 −4 −4

−4 −4 −4 −4

−1 −4 −4 −4

−4 −4 −4 −1

−3 −4 −2 −4

−4 0 −4 −4 −4 −4 −4 −4 −4 −4 −4 −4 −4 −4 −2 −4

−4 −4 −4 −4

−4 −3 −4 −4

−4 −4 −3 −2

−4 −4 −3 −2

−2 −4 −4 −4

−4 −4 −4 −4

−2 −4 −4 −3

−1 −4 −4 −4

−4 −4 −4 −4

−4 −4 −2 −4

−2 −4 −4 −4

−4 −4 −4 −4

−4 −4 0 −4 −4 −4

−4 −4 −4 −4 −4 −4

−2 −4 −2 −4 −4 −4

−4 −1 −3 −4 −4 −4

−4 −4 −4 −4 −4 −4

−3 −4 −2 −4 −4 −4

−4 −4 −4 −3 −1 −4

−4 −4 −4 −4 −4 −4

−4 −2 −4 −4 −4 −4

−3 −3 −4 −4 −4 −4

−4 −4 −4 −4 −4 −4

−4 −4 −4 −4 −4 −4

−4 −2 −4 −4 −4 −4

−4 −4 −4 −4 −4 −4

−4 −4 −4 −4 −4 −4

−4 −4 −4 −4 −4 −4

−4 −4 −4 −4 −4 −4

−4 −2 −4 −3 −4 −4

−2 −4 −4 −4 −4 −4

(b) A C D E ... R S T V Y W

0 0 0 0 0 0

3 0 0 0 1 0

(c) A C D E ... R S T V Y W

−4 −4 −4 −4 −4 −4

−2 −4 −4 −4 −4 −4

−2 −4 −4 −4 −4 −4

(d) SEQ X

A S K I S I L Y S S K T G K T E R V A K L I E E G −1 −1 −4 −4 −4 −4 −4 −4 −4 −4 −4 −4 −4 −4 −4 −2 −4 −4 −4 −4 −4 −4 −4 −4 −4 = −113.1

Fig. 5. (a) A multiple sequence alignment from which the frequency values shown in (b) can be derived. The profile shown in (c) contains the log2 of the values. Frequency values of 0 are replaced by 1 (method of the pseudo-count). Part (d) shows the score for a new sequence.

116

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

the coordinates for the former while the latter certainly produced distortions in the neighbouring amino acids. Finally, we need to model the conformation of the amino acid side chains. The first step, that is importing the coordinates of the backbone of aligned amino acids, is obviously trivial. Each of the other steps is troublesome and we do not have a reliable and general solution for any of them. The prediction community has established a worldwide experiment (CASP) for evaluating the effectiveness of methods for protein structure prediction and most of the conclusions reported below are based on the results of this experiment [105–109]. Every two years crystallographers and NMR spectroscopists who are about to solve a protein structure are asked to make the sequence of the protein available together with a tentative date for the release of the final coordinates. Predictors produce and deposit models for these proteins before the structures are made available and, finally, a panel of assessors compares the models with the structures as soon as they are available and tries to evaluate the quality of the models and to draw some conclusions about the state of the art of the different methods. The results are discussed in a meeting where assessors and predictors convene and the conclusions are made available to the whole scientific community via the World Wide Web and the publication of a special issue of the journal Proteins: Structure, Function, and Genetics. There have been CASP experiments in 1994 (CASP1), 1996 (CASP2), 1998 (CASP3), 2000 (CASP4) and 2002 (CASP5). The CASP6 experiment is ongoing. In the last CASP experiment, about 7000 comparative models were submitted [109] and therefore the results derived from their analysis can be considered of sufficient generality. A simple way to understand whether a method is able to model backbone local rearrangements is the following. Given the model, the template structure used to build it and the subsequently determined target structure we can ask whether, in the aligned regions, the model is closer to the real structure than the template used to build it. In the five editions of CASP (including more than 45 000 models for 222 proteins) no method has been convincingly achieving this result: any modification of the template backbone structure has not succeeded in obtaining coordinates significantly and consistently closer to the experimentally determined protein structure [110–113]. The results of the modelling of insertions and deletions are equally unsatisfactory. Here methods are usually based on either an energy driven search for the possible conformations of the region of interest or on a database search of regions of protein of known structure that can provide a local template [72,114–117]. The latter are usually selected on the basis of either a good fit of the regions flanking the region between target and local template, or on local sequence similarity. Side chain modelling is probably a less serious problem: it seems that, given the experimental backbone structure, several methods can reconstruct the side chains correctly and therefore any improvement in modelling the backbone will allow us to obtain reasonable results for the modelling of side chains [118–129]. There are also good news coming from the analysis of CASP results. First, models of targets for which a template sharing more than 30% sequence identity is available, are sufficiently accurate for most practical purposes, certainly to guide in designing experiments to obtain mutants with desired properties and in understanding general molecular mechanisms [113,130]. More distant evolutionary relationships can be used to build models proposing the correct architecture that are often instrumental, for example, for designing chimeras for biotechnological purposes and sometimes to reduce the number of possible hypotheses on the function of unknown proteins. Furthermore, the continuously growing number of sequence and structure is steadily increasing both the number of proteins that can be modelled by homology and their backbone accuracy.

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

117

Finally, there are two aspects of comparative modelling that make it the method of choice for protein structure prediction whenever it is applicable. First of all, the relative quality of a comparative model depends on the evolutionary distance between two proteins. In fact, both the probability of inferring the correct alignment between two proteins and the structural divergence between their structures are correlated with their evolutionary distance which can be estimated a priori. This implies both that it is possible to estimate the expected quality of a comparative model and its possible range of application beforehand and hence decide whether it is reasonable to embark in the task and also, perhaps most importantly, that one can attach a reliability to any of the conclusions derived from the analysis of the model [64,131]. The second, equally important aspect, is that the methodology will be especially effective in modelling regions of a protein that are more conserved during evolution. This implies that functionally important regions will be more correctly modelled than other, often of lower interest, regions [113,130].

9. The integral form of the folding problem Although the number of known protein structures and sequences grows at an impressive rate, we still too often face the problem of having to infer structural properties of proteins for which no homologous protein of known structure is available. Evolutionarily related proteins share similarities in sequence and structure. It is however possible, and observed, that the two proteins have diverged so much that the sequence signal falls below the detection level. It is equally possible, and observed, that some topologies or folds are used by proteins presumably not sharing any evolutionary relationship. Both observations allow us to reformulate once more the prediction problem in a different way. This time, rather than asking which is the structure of our protein, we can ask whether any of the known structures can represent a reasonable model for it, independently on our ability to detect an evolutionary relationship. This is equivalent to ask whether the sequence of our target protein fits any of the structures present in our database. Clearly, even if we detect such a fit, we do not expect the structures to be very similar, but still the template protein can represent a sufficient approximation of our protein structure useful for many applications. Therefore we now have to face the problem of evaluating sequence structure fitness. The methods can be roughly divided into two categories, although several hybrid combinations are possible. In one approach, we can analyse our sequence and replace each amino acid with a symbol indicating its propensity to be observed in a given structural environment. Usually we take into account the propensity of an amino acid for being in one of the secondary structure types, of being in a hydrophobic or an hydrophilic environment and of being more or less exposed to a polar solvent [99,132–139]. This will recast our sequence in a new sequence in a different alphabet. Whenever possible, we use a multiple alignment of all available proteins homologous to our target sequence, as we known that they will share the same (although yet unknown) structure. Next, we can analyse each of the proteins of known structure. For every position, we will not take into account which amino acid happens to be present, but rather examine the property of the position, i.e., its secondary structure, its environment and its exposition to

118

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

the solvent and will assign a character describing the observed combination of properties. In this way our database of protein structures will be represented by a set of strings. The string representing the query sequence can now be compared, with exactly the same techniques described for the detection of evolutionary relationships, with each of the strings representing the structures. Once again we will need a background distribution with which we compare the obtained score. This can be obtained by reshuffling our sequence, or by creating reasonable “decoy” structures. The other approach, known under the name of threading, builds as many models of the target proteins as there are structures in the database (or some reasonably selected subset of it) using each structure as a template, optimising a fitness parameter depending upon the interactions between the amino acids in the model [140–155]. The optimisation can be achieved by a technique called double dynamic programming (more complex than dynamic program since this time the score in each cell of the matrix depends upon the interactions between different amino acids which, in turn, depend on the path), or using some suitable approximations that will not be discussed here [156]. The fitness function is usually a pair-wise potential between amino acid side chains reflecting the likelihood of the observed set of interactions [157,158]. The latter is computed using the inverse Boltzmann equation [159]. In other words, we calculate the frequency of occurrence of each pair-wise interaction between amino acids in known protein structures, for each distance between the side chains and for each distance in the linear amino acid sequence. The probability of observing the interaction (approximated by its frequency) is related to its free energy difference with respect to the ground state by the inverse Boltzmann equation: E = −RT ln(po /pd ). Where po and pd are the probabilities of observing the interaction in the protein and in decoy structures, respectively, R is the Boltzmann constant and T the temperature. Once again, the background probability distributions can be obtained by reshuffling our sequence, or using a set of reasonable random structures. The usage of Boltzmann statistics might seem surprising since it is based on the assumption that the distribution is maintained by the equilibrium between different transition states of each particle, while proteins have a unique stable structure. However, it has been shown [160] that the number of random sequences able to stabilize a given structural element decreases exponentially with the energy of the element and therefore that, in the hypothesis that the final structure is stable, the statistics can be applied in terms of fluctuation in the space of the sequences. Once a putative template has been selected, these methods use the same strategy of comparative modelling to build the final three-dimensional model of the target protein and hence suffer from the same problems. The fold recognition methods expand significantly the set of proteins that we can model and often allow unexpected evolutionary relationships to be highlighted, but it is much more difficult to evaluate their reliability a priori and they cannot guarantee that functional regions are more reliably predicted than the rest of the structure. If no statistically significant hit can be found by with these methods and we are left with a sequence and no template, we recur to the recently developed fragment based methods [161,162]. These are based on the following idea: the sequence of the query protein is divided into fragments of different length and these fragments are used to select regions (more than one) with the same or similar sequence from proteins of known structure. The structures of the retrieved fragments are then randomly

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

119

assembled to produce a large number (above 1000) of possible models. Different stochastic procedures can be used to optimize the quality of these putative models, on the basis of a fitness function similar to that used for fold recognition methods. Stochastic procedures include Monte Carlo methods, where the initial structures are subjected to random changes that are accepted with a probability dependent upon the difference between the fitness of the original and the modified structure, or genetic algorithms. The latter are based on the concept of natural evolution and selection: the starting structures are subjected to “mutations” (i.e., changes in one of the backbone angles) or “crossing over” (exchanging substructures between the models) and their fitness is evaluated. The fitness value is then used to decide how many copies of the structure will be included in a subsequent set (generation) of structures that will be subjected to same procedure iteratively. These methods are under development, but the results that they have provided so far are extremely encouraging [162,163]. For completeness we will also mention that there are methods that try to predict some features of the protein structure, not necessarily its atomic coordinates, for example, the location of the elements of secondary structure or putative through-space contacts between amino acids distant in the chain. Even if they do not build a model of the protein, the information that they provide can be instrumental, for example, for selecting among putative models proposed by fold recognition or fragment based methods. The most successful secondary structure prediction algorithms are based on neural networks [164,165], automatic learning methods based on the availability of a set of known examples, e.g., sequences of regions for which we know the secondary structure, easily derivable from the database of known protein structures. A neural network is an assembly of neurons, that we represent as nodes, with one or more incoming connections, that we call input, and a leaving connection that we call output (Fig. 6) the value of which is related to the input via some arithmetic operation. A neuron “fires” if its output is higher than a given threshold and does not fire if the output is below the threshold. We can connect several neurons so that the output of some neurons represents the input of others. Each connection has a variable “weight” and therefore the node can contribute to the input of the next neuron to a different extent. Most neural network based methods, even if their aim is to provide a prediction on a single target amino acid, use as input a segment of the protein sequence including the target amino acid and a predefined number N of amino acids before and after the central one, to take into account context effects and use an input node for each of the amino acids of the input region. The segment is then moved along the protein sequence so that each amino acid, with the exception of the first N and the last N , will be used as target. The value of the final output node(s) is mapped to the property we want to predict. In secondary structure prediction we will have three output nodes, each associated to a type of secondary structure (α, β, other) and the central amino acid will be assigned to the secondary structure for which the output node has the highest value, or a value higher than a predefined threshold. During the learning phase, the known examples are used as input for the neural network and the network modifies the values of the weights in order to maximize the probability that the output matches the known answer. The weights derived during the learning phase can subsequently be used to calculate the answer for an unknown example. Successful secondary structure prediction methods use as input a profile derived from the alignment of the family of the target sequence [164,165].

120

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

Fig. 6. Scheme of a neural network for the prediction of protein secondary structure. In the inset a single neuron is shown. The output of the neuron can be some function of the weighted sum of the input values. The weights are “learned” by the network on the basis of known examples. The input of the network is a profile and, in this example, the prediction on a residue is obtained taking into account the ten adjacent positions. Each of the connections between nodes is characterized by a weight. Note the presence of a hidden layer, that is another set of neurons connecting the input to the output. Because of their presence the number of adjustable weights in this example is 81.

The data set used in the training phase is very important as it should contain a sufficiently diverse set of examples to allow the network to derive general properties of secondary structure segments rather than just recognizing the sequences of the training set. Methods aiming at predicting which amino acids are in contact in a protein structure usually try to do so by analysing the variations of each of the positions in a multiple sequence alignment and trying to detect positions that change in a correlated fashion [166,167]. For example, if two positions are always occupied by charged amino acids, different in different sequences, but always of opposite charge we can hypothesise that they are in contact in the structure. Similarly, if there are two hydrophobic residues in two different positions and we note that, in our sequence alignment, every time one of them is relatively big the other is small and vice versa, there is the possibility that this behaviour is driven by the fact that they are close in space and have to fill the same hydrophobic pocket in the protein. Rarely these methods can reliably predict a large set of interactions, but sometimes they can provide a few, sufficiently reliable, pairs and these can be sufficient to discriminate between different alternative folds for a protein sequence.

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

121

10. Future directions There are several interesting developments in the protein structure prediction field, and I am confident that more progress will become evident in the near future: the genomics projects are very demanding and there is a strong pressure on computational biologists for providing faster and more reliable methods for structure prediction, as witnessed by the number of groups participating in CASP and by the attention that the scientific community pays to the results of the experiment. The most recent developments in comparative modelling are based on the idea of constructing several models for each target protein and selecting the most likely one only at the end of the complete model building procedure [113,139,168,169]. In other words, rather than optimising independently each of the steps of the procedure, the most successful methods funnel into each subsequent step not only the optimal but also the sub-optimal intermediate results. The choice of the final model is based on selecting among the several resulting atomic structures. To some extent, this is as if all the parameters of the model building procedure (template selection, alignment quality, local templates for insertions and deletions and side chain positioning) were optimised at the same time rather than sequentially. The strategies for building several alternative models include the selection of templates not only on the basis of sequence similarity, but also on the basis of sequence–structure fitness evaluation, taking advantage of algorithms developed for fold recognition prediction methods. Sometimes, the template originally selected is used for searching the data base of structurally related proteins to select folds that can be used as alternative templates. Furthermore, both optimal and sub-optimal sequence alignments with each of the putative templates are used as the basis for model building and additional three-dimensional models are sometimes generated by combining fragments of the obtained models. The evaluation of the final set of models, after a structure clustering step, can be based on several independent criteria such as evaluation of local environment and inter-residue contacts, knowledge-based pair-wise potentials, stereo-chemical quality and, in some cases, visual inspection. These methods are able to produce, by and large, better models than those obtained by conventional step-wise modelling procedure [113], probably because it is more effective to evaluate the quality of a final three-dimensional complete model than that of each of the intermediate results of the procedure. One conclusion that can be drawn from what we said above is that the boundary between different prediction methods is becoming less and less clear-cut and that they all share one problem: the detection of the best model among a large set of putative ones. This is not an easy problem, because of all the limitations of energy based evaluation methods, but improvements in this direction would benefit all available methods for structure prediction and is therefore not surprising that this is the area where most computational groups are directing their efforts.

Acknowledgements This work was supported by the University of Rome “La Sapienza”, the Faculty of Medicine, the Italian Ministry of Health and the “Compagnia di San Paolo”. I am very grateful to Alexei Finkelstein and Oxana Galzitskaya for sending me the draft of their manuscript.

122

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

References [1] Venter JC, Adams MD, Myers EW, et al. The sequence of the human genome. Science 2001;291:1304. [2] Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature 2001;409:860. [3] Long M, de Souza SJ, Gilbert W. Evolution of the intron–exon structure of eukaryotic genes. Curr. Opin. Genet. Dev. 1995;5:774. [4] Gilbert W, Glynias M. On the ancient nature of introns. Gene 1993;135:137. [5] de Souza SJ. The emergence of a synthetic theory of intron evolution. Genetica 2003;118:117. [6] Horowitz DS, Krainer AR. Mechanisms for selecting 5 -splice sites in mammalian pre-mRNA splicing. Trends Genet 1994;10:100. [7] Herbert A, Rich A. RNA processing and the evolution of eukaryotes. Nat. Genet. 1999;21:265. [8] Nissim-Rafinia M, Kerem B. Splicing regulation as a potential genetic modifier. Trends Genet. 2002;18:123. [9] Nadal-Ginard B, Smith CW, Patton JG, et al. Alternative splicing is an efficient mechanism for the generation of protein diversity: Contractile protein genes as a model system. Adv. Enzyme Regul. 1991;31:261. [10] Boue S, Letunic I, Bork P. Alternative splicing and evolution. Bioessays: News Rev. Mol. Cellul. Develop. Biol. 2003;25:1031. [11] Roberts GC, Smith CW. Alternative splicing: Combinatorial output from the genome. Curr. Opin. Chem. Biol. 2002;6:375. [12] Graveley BR. Alternative splicing: Increasing diversity in the proteomic world. Trends Genet. 2001;17:100. [13] Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 2003;19(Suppl. 2):II215. [14] Burge CB, Karlin S. Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 1998;8:346. [15] Gao F, Zhang CT. Comparison of various algorithms for recognizing short coding sequences of human genes. Bioinformatics 2004;20:673. [16] Chou KC. Prediction of protein signal sequences. Curr. Protein Pept. Sci. 2002;3:615. [17] Zhang MQ. Computational prediction of eukaryotic protein-coding genes. Nat. Rev. Genet. 2002;3:698. [18] Haussler D. Computational genefinding. Bioinformatics 1998;1998:12. [19] Brazma A, Vilo J. Gene expression data analysis. FEBS Lett. 2000;480:17. [20] Shannon W, Culverhouse R, Duncan J. Analyzing microarray data using cluster analysis. Pharmacogenomics 2003;4:41. [21] Tamames J, Clark D, Herrero J, et al. Bioinformatics methods for the analysis of expression arrays: Data clustering and information extraction. J. Biotechnol. 2002;98:269. [22] Vilo J, Kivinen K. Regulatory sequence analysis: Application to the interpretation of gene expression. Eur. Neuropsychopharmacol. 2001;11:399. [23] Mann M, Hendrickson RC, Pandey A. Analysis of proteins and proteomes by mass spectrometry. Ann. Rev. Biochem. 2001;70:437. [24] Mann M, Pandey A. Use of mass spectrometry-derived data to annotate nucleotide and protein sequence databases. Trends Biochem. Sci. 2001;26:54. [25] Andersen JS, Mann M. Functional genomics by mass spectrometry. FEBS Lett. 2000;480:25. [26] Moseley MA. Current trends in differential expression proteomics: Isotopically coded tags. Trends Biotechnol. 2001;19:S10. [27] Anfinsen CB, Harrington WF, Hvidt A, et al. Studies on the structural basis of ribonuclease activity. Biochim. Biophys. Acta 1955;17:141. [28] Shakhnovich EI, Finkelstein AV. Theory of cooperative transitions in protein molecules: I. Why denaturation of globular protein is a first-order phase transition. Biopolymers 1989;28:1667. [29] Finkelstein A. Protein structure: What is it possible to predict now?. Curr. Opin. Struct. Biol. 1997;7:60. [30] Rost B. Enzyme function less conserved than anticipated. J. Mol. Biol. 2002;318:595. [31] Iliopoulos I, Tsoka S, Andrade MA, et al. Evaluation of annotation strategies using an entire genome sequence. Bioinformatics 2003;19:717. [32] Orengo CA, Todd AE, Thornton JM. From protein structure to function. Curr. Opin. Struct. Biol. 1999;9:374. [33] Dill KA, Phillips AT, Rosen JB. Protein structure and energy landscape dependence on sequence using a continuous energy function. J. Comput. Biol. 1997;4:227.

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

123

[34] Finkelstein AV, Shakhnovich EI. Theory of cooperative transitions in protein molecules: II. Phase diagram for a protein molecule in solution. Biopolymers 1989;28:1681. [35] Finkelstein AV, Reva BA. Search for the stable state of a short chain in a molecular field. Protein Eng. 1992;5:617. [36] Finkelstein AV, Reva BA. Search for the most stable folds of protein chains: I. Application of a self-consistent molecular field theory to a problem of protein three-dimensional structure prediction. Protein Eng. 1996;9:387. [37] Galzitskaya OV, Finkelstein AV. Folding of chains with random and edited sequences: Similarities and differences. Protein Eng. 1995;8:883. [38] Galzitskaya OV, Ivankov DN, Finkelstein AV. Folding nuclei in proteins. FEBS Lett. 2001;489:113. [39] Leopold PE, Montal M, Onuchic JN. Protein folding funnels: A kinetic approach to the sequence-structure relationship. Proc. Natl. Acad. Sci. USA 1992;89:8721. [40] Reva BA, Finkelstein AV. Search for the most stable folds of protein chains: II. Computation of stable architectures of beta-proteins using a self-consistent molecular field theory. Protein Eng. 1996;9:399. [41] Levinthal C. Are there pathways for protein folding?. J. Chim. Phys. Phys.-Chim. Biol. 1968;65:44. [42] Locker CR, Hernandez R. A minimalist model protein with multiple folding funnels. Proc. Natl. Acad. Sci. USA 2001;98:9074. [43] Nymeyer H, Garcia AE, Onuchic JN. Folding funnels and frustration in off-lattice minimalist protein landscapes. Proc. Natl. Acad. Sci. USA 1998;95:5921. [44] Onuchic JN, Luthey-Schulten Z, Wolynes PG. Theory of protein folding: The energy landscape perspective. Ann. Rev. Phys. Chem. 1997;48:545. [45] Socci ND, Onuchic JN, Wolynes PG. Protein folding mechanisms and the multidimensional folding funnel. Proteins 1998;32:136. [46] Wolynes PG. Folding funnels and energy landscapes of larger proteins within the capillarity approximation. Proc. Natl. Acad. Sci. USA 1997;94:6170. [47] Finkelstein AV, Badretdinov YA. Rate of protein folding near the point of thermodynamic equilibrium between the coil and the most stable chain fold. Fold. Des. 1997;2:115. [48] Galzitskaya OV, Skoogarev AV, Ivankov DN, et al. Folding nuclei in 3D protein structures. Pac. Symp. Biocomput. 2000:131. [49] Finkelstein AV, Galzitskaya AV, Physics of protein folding, Phys. Life Rev. 2004. In press. [50] Finkelstein A, Gutin A, Badretdinov A. Perfect temperature for protein structure prediction and folding. Proteins 1995;23:151. [51] Bernstein FC, Koetzle TF, Williams GJ, et al. The Protein Data Bank: A computer-based archival file for macromolecular structures. Eur. J. Biochem. 1977;80:319. [52] Branden CI, Tooze J. Introduction to Protein Structure. New York: Garland Publishing; 1999. [53] Lesk AM. Introduction to Protein Architecture. Oxford: Oxford University Press; 2001. [54] Finkelstein AV, Ptitsyn OB. Protein Physics. London: Academic Press; 2002. [55] Chothia C, Finkelstein AV. The classification and origins of protein folding patterns. Annu. Rev. Biochem. 1990;59:1007. [56] Chothia C. Proteins. One thousand families for the molecular biologist. Nature 1992;357:543. [57] Brenner S, Chothia C, Hubbard T. Population statistics of protein structures: Lessons from structural classifications. Curr. Opin. Struct. Biol. 1997;7:369. [58] Hadley C, Jones DT. A systematic comparison of protein structure classifications: SCOP. CATH and FSSP, Structure 1999;7:1099. [59] Heger A, Holm L. Towards a covering set of protein family profiles. Prog. Biophys. Mol. Biol. 2000;73:321. [60] Holm L, Sander C. The FSSP database: Fold classification based on structure–structure alignment of proteins. Nucleic Acids Res. 1996;24:206. [61] Lo Conte L, Ailey B, Hubbard TJ, et al. SCOP: A structural classification of proteins database. Nucleic Acids Res. 2000;28:257. [62] Orengo C. Classification of protein folds. Curr. Opin. Struct. Biol. 1994;4:429. [63] Orengo CA, Pearl FM, Bray JE, et al. The CATH Database provides insights into protein structure/function relationships. Nucleic Acids Res. 1999;27:275. [64] Chothia C, Lesk A. The relation between the divergence of sequence and structure in proteins. EMBO J. 1986;5:823. [65] Chothia C, Lesk A. The evolution of protein structures. Cold Spring Harb. Symp. Quant. Biol. 1987;52:399.

124

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

[66] Dayhoff MO, Schwartz RM, Orcutt BC. In: Dayhoff MO, editor. Atlas of Protein Sequence and Structure, vol. 5. Washington, DC: National Biomedical Research Foundation; 1978, p. 345. [67] Edwards AWF, Cavalli-Sforza L. The reconstruction of evolution. Ann. Hum. Genet. 1963;27:105. [68] Klotz LC, Komar N, Blanken RL, et al. Calculation of evolutionary trees from sequence data. Proc. Natl. Acad. Sci. USA 1979;76:4516. [69] Lindahl E, Elofsson A. Identification of related proteins on family, superfamily and fold level. J. Mol. Biol. 2000;295:613. [70] Lipman DJ, Pearson WR. Rapid and sensitive protein similarity searches. Science 1985;227:1435. [71] Mehta PK, Argos P, Barbour AD, et al. Recognizing very distant sequence relationships among proteins by family profile analysis. Proteins 1999;35:387. [72] Sanchez R, Sali A. Advances in comparative protein-structure modelling. Curr. Opin. Struct. Biol. 1997;7:206. [73] Hilbert M, Bohm G, Jaenicke R. Structural relationships of homologous proteins as a fundamental principle in homology modeling. Proteins 1993;17:138. [74] Sali A. Modeling mutations and homologous proteins. Curr. Opin. Biotechnol. 1995;6:437. [75] Bates P, Jackson R, Sternberg M. Model building by comparison: A combination of expert knowledge and computer automation. Proteins 1997;Suppl. 1:59. [76] May A, Blundell T. Automated comparative modelling of protein structures. Curr. Opin. Biotechnol. 1994;5:355. [77] Taylor W. Protein structure modelling from remote sequence similarity. J. Biotechnol. 1994;35:281. [78] Kaden F, Koch I, Selbig J. Knowledge-based prediction of protein structures. J. Theor. Biol. 1990;147:85. [79] Tramontano A. Homology modeling with low sequence identity. Methods (San Diego, CA) 1998;14:293. [80] Moult J. Predicting protein three-dimensional structure. Curr. Opin. Biotechnol. 1999;10:583. [81] Sali A. 100 000 protein structures for the biologist. Nat. Struct. Biol. 1998;5:1029. [82] Moult J. The current state of the art in protein structure prediction. Curr. Opin. Biotechnol. 1996;7:422. [83] Al Lazikani B, Jung J, Xiang Z, et al. Protein structure prediction. Curr. Opin. Chem. Biol. 2001;5:51. [84] Bajorath J, Stenkamp R, Aruffo A. Knowledge-based model building of proteins: Concepts and examples. Protein Sci. 1993;2:1798. [85] Ring CS, Cohen FE. Modeling protein structures: construction and their applications. The FASEB J. 1993;7:783. [86] Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970;48:442. [87] Smith T, Waterman M. Identification of common molecular subsequences. J. Mol. Biol. 1981;147:195. [88] Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 1992;89:10915. [89] Altschul S. Gap costs for multiple sequence alignment. J. Theor. Biol. 1989;138:297. [90] Taylor WR. An investigation of conservation-biased gap-penalties for multiple protein sequence alignment. Gene 1995;165:GC27. [91] Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J. Mol. Biol. 1990;215:403. [92] Pearson WR. Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 2000;132:185. [93] Taylor W, Flores T, Orengo C. Multiple protein structure alignment. Protein Sci. 1994;3:1858. [94] Park J, Teichmann SA, Hubbard T, et al. Intermediate sequences increase the detection of homology between sequences. J. Mol. Biol. 1997;273:349. [95] Karplus K, Barrett C, Hughey R. Hidden Markov models for detecting remote protein homologies. Bioinformatics 1998;14:846. [96] Gribskov M, McLachlan AD, Eisenberg D. Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 1987;84:4355. [97] Rychlewski L, Jaroszewski L, Li W, et al. Comparison of sequence profiles: Strategies for structural predictions using sequence information. Protein Sci. 2000;9:232. [98] Thompson JD, Higgins DG, Gibson TJ. Improved sensitivity of profile searches through the use of sequence weights and gap excision. Comput. Appl. Biosci. 1994;10:19. [99] Bateman A, Birney E, Durbin R, et al. The Pfam protein families database. Nucleic Acids Res. 2000;28:263. [100] Pearson WR. Effective protein sequence comparison. Methods Enzymol. 1996;266:227. [101] Park J, Karplus K, Barrett C, et al. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol. 1998;284:1201.

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

125

[102] Marti-Renom MA, Madhusudhan MS, Sali A. Alignment of protein sequences by their profiles. Protein Sci. 2004;13:1071. [103] Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389. [104] Altschul SF, Koonin EV. Iterated profile searches with PSI-BLAST—a tool for discovery in protein databases. Trends Biochem. Sci. 1998;23:444. [105] Moult J, Pedersen J, Judson R, et al. A large-scale experiment to assess protein structure prediction methods. Proteins 1995;23:ii. [106] Moult J, Hubbard T, Bryant S, et al. Critical assessment of methods of protein structure prediction (CASP): Round II. Proteins 1997;Suppl. 1:2. [107] Moult J, Hubbard T, Fidelis K, et al. Critical assessment of methods of protein structure prediction (CASP): Round III. Proteins 1999;Suppl. 3:2. [108] Moult J, Fidelis K, Zemla A, et al. Critical assessment of methods of protein structure prediction (CASP): Round IV. Proteins 2001;Suppl. 5:2. [109] Moult J, Zemla A, Fidelis K, et al. Critical assessment of methods of protein structure prediction (CASP): Round V. Proteins 2003;Suppl. 6:334. [110] Martin AC, MacArthur MW, Thornton JM. Assessment of comparative modeling in CASP2. Proteins 1997;Suppl. 1:14. [111] Jones A, Kleywegt GJ. CASP3 comparative modeling evaluation. Proteins 1999;Suppl. 3:30. [112] Tramontano A, Leplae R, Morea V. Analysis and assessment of comparative modeling predictions in CASP4. Proteins 2001;45(Suppl. 5):22. [113] Tramontano A, Morea V. Assessment of homology-based predictions in CASP5. Proteins 2003;53(Suppl. 6):352. [114] Sibanda B, Thornton J. Beta-hairpin families in globular proteins. Nature 1985;316:170. [115] Leszczynski J, Rose G. Loops in globular proteins: A novel category of secondary structure. Science 1986;234:849. [116] Tramontano A. In: Villar HO, editor. Advances in Computational Biology, vol. 2. Greenwich: JAI Press; 1995, p. 239. [117] Fiser A, Do RK, Sali A. Modeling of loops in protein structures. Protein Sci. 2000;9:1753. [118] Chinea G, Padron G, Hooft R, et al. The use of position-specific rotamers in model building by homology. Proteins 1995;23:415. [119] Schrauber H, Eisenhaber F, Argos P. Rotamers: To be or not to be? An analysis of amino acid side-chain conformations in globular proteins. J. Mol. Biol. 1993;230:592. [120] Dunbrack Jr R, Karplus M. Backbone-dependent rotamer library for proteins: Application to side-chain prediction. J. Mol. Biol. 1993;230:543. [121] Ogata K, Umeyama H. Prediction of protein side-chain conformations by principal component analysis for fixed mainchain atoms. Protein Eng. 1997;10:353. [122] Keller D, Shibata M, Marcus E, et al. Finding the global minimum: A fuzzy end elimination implementation. Protein Eng. 1995;8:893. [123] Ponder J, Richards F. Tertiary templates for proteins: Use of packing criteria in the enumeration of allowed sequences for different structural classes. J. Mol. Biol. 1987;193:775. [124] Bower M, Cohen F, Dunbrack Jr RL. Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: A new homology modeling tool. J. Mol. Biol. 1997;267:1268. [125] Holm L, Sander C. Fast and simple Monte Carlo algorithm for side chain optimization in proteins: Application to model building by homology. Proteins 1992;14:213. [126] Eisenmenger F, Argos P, Abagyan R. A method to configure protein side-chains from the main-chain trace in homology modelling. J. Mol. Biol. 1993;231:849. [127] Hwang JK, Liao WF. Side-chain prediction by neural networks and simulated annealing optimization. Protein Eng. 1995;8:363. [128] Dunbrack RL. Comparative modeling of CASP3 targets using PSI-BLAST and SCWRL. Proteins 1999;Suppl. 3:81. [129] Vasquez M. Modeling side-chain conformation. Curr. Opin. Struct. Biol. 1996;6:217. [130] Tramontano A. Of men and machines. Nat. Struct. Biol. 2003;10:87. [131] Pizzi E, Tramontano A, Tomei L, et al. Molecular model of the specificity pocket of the hepatitis C virus protease: Implications for substrate recognition. Proc. Natl. Acad. Sci. USA 1994;91:888. [132] Di Francesco V, Garnier J, Munson PJ. Improving protein secondary structure prediction with aligned homologous sequences. Protein Sci. 1996;5:106.

126

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

[133] Ouzounis C, Sander C, Scharf M, et al. Prediction of protein structure by evaluation of sequence-structure fitness: Aligning sequences to contact profiles derived from three-dimensional structures. J. Mol. Biol. 1993;232:805. [134] Wilmanns M, Eisenberg D. Inverse protein folding by the residue pair preference profile method: Estimating the correctness of alignments of structurally compatible sequences. Protein Eng. 1995;8:627. [135] Luthy R, McLachlan AD, Eisenberg D. Secondary structure-based profiles: Use of structure-conserving scoring tables in searching protein sequence databases for structural similarities. Proteins 1991;10:229. [136] Zhang KY, Eisenberg D. The three-dimensional profile method using residue preference as a continuous function of residue environment. Protein Sci. 1994;3:687. [137] Elofsson A, Fischer D, Rice DW, et al. A study of combined structure/sequence profiles. Fold. Des. 1996;1:451. [138] Bowie JU, Luthy R, Eisenberg D. A method to identify protein sequences that fold into a known three-dimensional structure. Science 1991;253:164. [139] Petrey D, Xiang Z, Tang CL, et al. Using multiple structure alignments, fast model building, and energetic analysis in fold recognition and homology modeling. Proteins 2003;53(Suppl. 6):430. [140] Lemer C, Rooman M, Wodak S. Protein structure prediction by threading methods: Evaluation of current techniques. Proteins 1995;23:337. [141] Madej T, Gibrat J, Bryant S. Threading a database of protein cores. Proteins 1995;23:356. [142] Jones D, Miller R, Thornton J. Successful protein fold recognition by optimal sequence threading validated by rigorous blind testing. Proteins 1995;23:387. [143] Marchler-Bauer A, Bryant S. Measures of threading specificity and accuracy. Proteins 1997;Suppl. 1:74. [144] Rost B, Schneider R, Sander C. Protein fold recognition by prediction-based threading. J. Mol. Biol. 1997;270:471. [145] Miller RT, Jones DT, Thornton JM. Protein fold recognition by sequence threading: Tools and assessment techniques. FASEB J. 1996;10:171. [146] Panchenko A, Marchler-Bauer A, Bryant SH. Threading with explicit models for evolutionary conservation of structure and sequence. Proteins 1999;Suppl. 3:133. [147] Jones DT. GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 1999;287:797. [148] Jaroszewski L, Rychlewski L, Zhang B, et al. Fold prediction by a hierarchy of sequence, threading, and modeling methods. Protein Sci. 1998;7:1431. [149] Huang ES, Subbiah S, Tsai J, et al. Using a hydrophobic contact potential to evaluate native and near-native folds generated by molecular dynamics simulations. J. Mol. Biol. 1996;257:716. [150] Yadgari J, Amir A, Unger R. Genetic algorithms for protein threading. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1998;6:193. [151] David JMT, Jones T. Potential energy functions for threading. Curr. Opin. Struct. Biol. 1996;6:210. [152] Godzik A, Skolnick J. Sequence-structure matching in globular proteins: Application to supersecondary and tertiary structure determination. Proc. Natl. Acad. Sci. USA 1992;89:12098. [153] Reva BA, Finkelstein AV, Sanner MF, et al. Accurate mean-force pairwise-residue potentials for discrimination of protein folds. Proc. Natl. Acad. Sci. USA 1997:373. [154] Reva BA, Finkelstein AV, Sanner MF, et al. Residue-residue mean-force potentials for protein structure recognition. Protein Eng. 1997;10:865. [155] Mirny LA, Finkelstein AV, Shakhnovich EI. Statistical significance of protein structure prediction by threading. Proc. Natl. Acad. Sci. USA 2000;97:9978. [156] Kolinski A., Skolnick J., Godzik A., An algorithm for prediction of structural elements in small proteins, Pac. Symp. Biocomput. 1996, 446. [157] Jones DT, Thornton JM. Potential energy functions for threading. Curr. Opin. Struct. Biol. 1996;6:210. [158] Sippl MJ. Knowledge-based potentials for proteins. Curr. Opin. Struct. Biol. 1995;5:229. [159] Pohl FM. Empirical protein energy maps. Nat. New Biol. 1971;234:277. [160] Finkelstein AV, Badretdinov AY, Gutin AM. Why do protein architectures have Boltzmann-like statistics?. Proteins 1995;23:142. [161] Bonneau R, Tsai J, Ruczinski I, et al. Rosetta in CASP4: Progress in ab initio protein structure prediction. Proteins 2001;Suppl. 5:119. [162] Jones DT, McGuffin LJ. Assembling novel protein folds from super-secondary structural fragments. Proteins 2003;53(Suppl. 6):480.

A. Tramontano / Physics of Life Reviews 1 (2004) 103–127

127

[163] Bradley P, Chivian D, Meiler J, et al. Rosetta predictions in CASP5: Successes, failures, and prospects for complete automation. Proteins 2003;53(Suppl. 6):457. [164] Rost B, Sander C, Schneider R. PHD—an automatic mail server for protein secondary structure prediction. Comput. Appl. Biosci. 1994;10:53. [165] Rost B, Sander C. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 1993;232:584. [166] Gobel U, Sander C, Schneider R, et al. Correlated mutations and residue contacts in proteins. Proteins 1994;18:309. [167] Olmea O, Rost B, Valencia A. Effective use of sequence correlation and conservation in fold recognition. J. Mol. Biol. 1999;293:1221. [168] Ginalski K, Rychlewski L. Protein structure prediction of CASP5 comparative modeling and fold recognition targets using consensus alignment approach and 3D assessment. Proteins 2003;53(Suppl. 6):410. [169] Kosinski J, Cymerman IA, Feder M, et al. A “Frankenstein’s monster” approach to comparative modeling: Merging the finest fragments of fold-recognition models and iterative model refinement aided by 3D structure evaluation. Proteins 2003;53(Suppl. 6):369.