TIBS 21 -
FRONTLINES
AUGUST 1996
these groups. This is a clear indication of the interest in structure predictions from the molecular biology community. Each submitter was requested to specify the minimum level of detail at which a prediction could be useful. Interestingly, with the exception of the last group, a secondary structure preThe field of protein structure prediction can be achieved with existing structure diction was often considered useful. moves rapidly and in the last few years prediction methods outside the compe- This stresses not only the relevance of there have been a number of important tition environment. To do this we will secondary structure prediction servers developments. The accuracy of meth- consider the results from a recent in- available to the community, but also the ods such as secondary structure pre- tensive structure prediction workshop t. importance of the results of impartial evaluation being available to all. It is diction has improved; fold recognition interesting to note that in only 17% of methods have appeared; and a number Collection, analysis, selection and cases was it believed that a threeof events have been organized to at- prediction For the workshop, the scientific com- dimensional all-atom model alone would tempt to apply available methods and munity was asked to suggest biologi- be helpful. This might reflect the impact evaluate their range of application cally important targets whose predic- of both correct predictions 1,u-a4 and under near real-life conditions. Recently one event has stimulated the tion was felt to be useful, and authors examples of the successful combination field of structure prediction dramati- of a number of successful ab initio and of approximate structural models and cally: theoreticians were challenged to fold recognition methods 3-1~ were in- experimental investigation a5-~7. The targets were automatically anapredict a number of structures about vited to be teachers. In total, 112 differto be determined experimentally. This ent sequence targets were received. lysed by a number of sequence analysis Structure Prediction Competition1,2 ran This is certainly not an unbiased sam- tools (Table I) to collect as much inforfor most of 1994 and was followed by ple of all possible targets, because we mation as possible and to screen out a meeting in December 1994 to criti- expect that sequences judged to be any sequence that could be modelled by cally assess the predictions*. A second unpredictable are likely to be under- homology, as this technique was not competition has started and will run represented. Nevertheless, some inter- covered during the workshop. All data for most of this year, with an evalu- esting information can be derived from related to submitted targets and the ation meeting in December (see URL: the analysis of these sequences. About critically evaluated results for predictions http://iris4.carb.nist.gov/casp2/c). Already two thirds of the targets were submit- are publicly available on the World though, it has become clear that major ted by scientists working experiment- Wide Web (WWW) at URL: http://www. advances have been made in second- ally on the protein and a further quar- mrc-cpe.cam.ac.uk/irbm-course95/. This ary structure prediction and fold recog- ter from people in contact with a group might be considered a useful database nition techniques and that structure working experimentally on them. Only of protein sequences for which there is predictions made with these methods one tenth did not belong to either of an interest in obtaining predictions. In 17% of cases, a sequence relationcan at last be taken seriously. ship with known structures was found, Although very useful, competitions do not perfectly reflect the situation of a tlRBM practical course: Frontiers of Protein Structure some of which had not been detected Istituto di recherche di Biologia Molecoby the submitter. The rapidly increasing biologist wishing to predict the structure Prediction, lare (IRBM) P. Angeletti, Via Pontina Km 30.600, size of the structure database makes it of a particular protein. A competition 00040 Pomezia, Roma, Italy, 8-17 October 1995. ever more important to continuously imposes limits on the targets to be pre- (URL: http://www.mrocpe.cam.ac.uk/irbm-course95/). dicted because they have to be chosen among those whose structure is about Table I. Tools used to analyse target sequences to be solved, only a subset of which are Method Database(s) searched Result sequences of interest to biologists. Each competition entry also tends to be a BLASTP2~ FASTA21 and PDB90 and refSDB ~ Sequence similarities to sequences prediction made using the particular SSEARCH22 of known structure, with automatic methods of the competing laboratory listing of potential hits based on reliability cut-off for SSEARCH 22 rather than a prediction based on combining the results from all available methBLASTP2~ and FASTA21 Non redundant protein sequence List of sequences belonging to same family databases ods. Finally, it is important to consider what level of accuracy and prediction BLOCKS 24 BLOCKSEARCH23 List of matches to known protein families detail is actually useful to an experimentalist, as well as the absolute reliaMotif Search PROSlTE motif database 25 List of matches to known protein signature sequences bility of any method. Here we wish to draw some general GCG Pileup program 26 applied Multiple sequence alignment conclusions for the biologist about what to sequences belonging to
Protein structure prediction: playing the fold
same family *Meeting on the Critical Assessment of Techniques for Protein Structure Prediction, Asilomar Conference Center, Pacific Grove, CA, USA, 4-8 December 1994. (URL: http://iris4.carb.nist, gov/).
9 1996,ElsevierScienceLtd
Multiple sequence alignment sent to PHD server 7
Secondary structure prediction
aT. J. P. Hubbard and S. E. B. Brenner, unpublished.
PII: S0968-0004(96)20018-0
279
FRONTLINES
TIBS 21 - AUGUST 1 9 9 6
Step I
Sensitive search against PDB sequences, PROSITE patterns, BLOCK sequences
Match ~
Homology model
Target ~ e q u e n c e Step 2
/
~
J
Search sequence database, build multiple alignment, refine manually
~
/ J /
~ /
S
Multiple alignment
Distant homologues
Secondary structure prediction
Fold recognition methods
Correlated mutations Conserved residues J3-sheet pairing
Step 3 Experimental data
~
o
n
data
GLASS
Figure 1 Schematic guide to steps for structure prediction. Programs used in steps 1 and 2 are given in Table I. Additional programs include MPSRCH,SCANPS(G. Barton, unpublished) and HMM27. For multiple sequence alignment, the programs used were MaxHom8, CLUSTALW28 and AMPS 3. For fold recognition, programs used were THREADER6, ProFIT 9,29,3~ MAP (G. Barton, unpublished) and Topits (B. Rost, unpublished). For ab initio predictions: of secondary structure, programs used were PHD7 and RUNPRED(G. Barton, unpublished); and of long-range interactions, the programs Correlated Mutations1~and PREDBB4,5 were used; and of functional residues, the program SEQUENCESPACE3r was used. Finally, for step 3, GLASS was used (R. Leplae, T. J. P. Hubbard and A. Tramontano, unpublished).
search for sequence-structure relationships. Once again this stresses the usefulness of servers available to perform this task, such as scop TM,which allows a search against sequences of published structures, even if the coordinates are not available from the Protein DataBank (PDB). Before the workshop began the instructors ranked the submitted sequences according to their own subjective criteria for good prediction targets, using a WWW-based voting system. This allowed 12 top-scoring targets to be selected, which, together with five targets submitted by the course participants, were analysed in detail. The steps shown schematically in Fig. 1 are a general guide for structure prediction. All targets were checked for 280
sequence similarity to known structures automatically in step 1. In step 2, prediction data are collected using available fold recognition and ab initio prediction programs. The critical interpretation of these results can be problematic, as can be the identification of an overall prediction that is consistent with all available data. However, the development of software that allows fold recognition, multiple sequence alignment and ab initio prediction data to be projected and manipulated in three dimensions is under way (for example, GLASS;R. Leplae, T. J. P. Hubbard and A. Tramontano, unpublished). During the workshop the participants felt that they could make a reliable prediction at some level of detail for 11 of the 17 selected targets. In nine cases, the
level of detail of the prediction generated matched or exceeded the minimum level requested, although their accuracy will only be established if and when experimental structures are determined. The predictions themselves are described elsewhere 19 (see also URL: http://www. mrc-cpe.cam.ac.uk/irbm-course95/). What did we learn? What makes a protein a good target for prediction? It is instructive to compare the length and family size distributions of the target sequences submitted, those selected and those where predictions could be generated. The length distribution of the targets submitted is very similar to that of the non-redundant sequence database (data not shown), while the sequences selected by the teachers tend to be much shorter (Fig. 2a). The targets submitted by course participants and selected for prediction are less biased towards short sequences. For some of the prediction methods, a large family size and a wide distribution of similarities is either a prerequisite or an advantage and secondary structure prediction is more reliable in these cases. Accordingly, most of the targets were selected by the instructors on the basis of family size (Fig. 2b). However, once again the distribution of the targets submitted by course participants and selected for prediction are less biased towards a large family size. For 17 targets worked on in detail, prediction results were obtained for only half of the targets selected by the instructors, but for all of the sequences submitted by course participants. This asymmetry was not anticipated and apparently results from the different selection criteria. If it is a true result, it might be explained by the fact that, in general, the participants were working experimentally on the targets they submitted, while the teachers were not. We did observe that discouraging initial resuits tended to lead instructors to select a new target, whereas the participants showed much more determination in achieving a prediction for their own protein (possibly out of fear of returning home to their supervisor emptyhanded!). We are convinced that in a protein structure prediction problem, important factors are the exploration of all possible routes without becoming too discouraged and a thorough knowledge of the biological background of the target. These factors suggest that more predictions
FRONTLINES
TIBS 2 1 - A U G U S T 1 9 9 6
will be obtained if the central figure in the prediction process is the experimentalist working on the protein rather than the theoretician. This outcome can be achieved by theoreticians making prediction methods and evaluation criteria wideiy avai}able, and experimentalists making an effort to use them and integrate the results with their own experimental data. If similar determination had been applied to the remaining 76 cases, we suspect that predictions could have been made for a large proportion. The raw analysis data, on which all predictions were based, are available on the WWW-based database, at: URL:http:// www.mrc-cpe.cam.ac.uk/irbm-course95/, and we hope that this will result in predictions for a number of those remaining non-selected targets. Acknowledgements
We would like to express our gratitude to IRBM staff and in particular to R. Cortese for encouragement and advice; to the Information System and Technology Department for invaluable technical help; to Silicon Graphics, Q-Associates and Biosym for providing part of the hardware and software used during the workshop. We are also grateful to all the people who submitted target sequences, to all the participants, instructors and seminar speakers. We are grateful to IRBM for financial support. T. H. is grateful to the MRC and to Zeneca for financial support.
References 1 Moult, J., Pedersen, J. T., Fidelis, K. and Judson, R. (1995) Proteins 23, if-iv 2 Shortle, D. (1995) Struct. Biol. 2, 91-93 3 Barton, G. J. and Sternberg, M. J. (1990) J. MoI. Biol. 212,389-402 4 Hubbard, T. J. (1994) in Proceedings of the Biotechnology Computing Track, Protein Structure Prediction MiniTrack of the 27th HICSS (Lathrop, R. H., ed.), pp. 336-354,
IEEE Computer Society Press 5 Hubbard, f. J. P. and Park, J. (1995) Proteins 23, 398-402 6 Jones, D. T., Taylor,W. R. and Thornton. J. M. (1992) Nature 358, 86-89
7 Rost, B. and Sander, C. (1993) (a) J. Mol. Biol, 232,584-599 8 Sander, C. and Schneider, R. (1991) Proteins 9, 56-68 25 1 i 9 All targets 9 Sippl, M. J. and Weitckus, S. 1 m 9 Predicted(selectedby teachers) : (1992) Proteins 13, 258-271 20 -1 m 9 Predicted(submittedby participants) 10 Gobel, U., Sander,C., Schneider,R. and Valencia, A. (19941 Proteins 18, 309-317 11 Pearl, L. H. and Taylor,W. R. (1987) Nature 329, 351-354 12 Gibson, T. J., Postma, J. P. M., Brown, R. S. and Argos, P. (1988) Protein. Eng. 2,209-218 13 Bazan, J. F. (1990) Immunol. Today 11, 350-354 14 Madej, T., Boguski, M. S. and Bryant, S. H_ (1995) FEBS Left. 373, 13-18 15 Savino, R. et aL (1994) EMBOJ. 13, 1357-1367 ' 0 ] 0 0 200 300 400 500 600 700 800 9 0 0 ] 0 0 0 2 ( ? 0 0 16 Savino, R. et al. (1994) EMBOJ. (b) Sequence length (amino acids) 13, 5863-5870 17 Failla, C. M., Pizzi, E., 69 De Francesco, R. and Tramontano. A. (1995) Fold. Design 1, 35-42 20 -~II II I 9 Predicted(selectedby teachers) 18 Murzin, A., Brenner, S. E., /11 II I 9 Predicted(submittedby participants) | Hubbard, T. J. P. /I I I D Non'predicted ] and Chothia, C. (1995) J. Mol. Biol. 247,536-540 19 Hubbard, T. J. P. et al. (1996) Fold, Design 1, R55-R63 20 Altschul, S. F. et al. (1990) J. Mol. Biol. 215, 403-410 21 Pearson, W. R. and Lipman, D. J. (1988) Proc. Natl, 4cad. Sci. U. S. A. 85, 2444-2448 22 Pearson, W. R. (1995) Protein Sci. 4, 1145-1160 23 Fuchs, R. (1993) Comput. Appl. 0 10 20 30 40 50 60 70 80 90 100 Biosci. 9, 587-591 24 Henikoff, S. and Henikoff, J. G. L Family size (1991) Nucleic Acids Res. 19, 6565-6572 Figure 2 25 Bairoch, A. and Bucher, P. (1994) (a) Distribution of protein sequence length of targets. Nucleic Acids Res. 22, (b) Distribution of sequence family size of targets. 3583-3589 26 Devereux, J., Haeberli, P. and Smithies, O. (1984) Nucleic Acids Res. 12,387-395 TIM HUBBARD AND JONG PARK 27 Eddy, S. R., Mitchison, G. and Durbin, R. J. Comp. Biol. (in press) Centre for Protein Engineering, MRC, 28 Thompson, J. D., Higgins, D. G. and Cambridge, UK CB2 2QH. Gibson, T. J. (1994) Nucleic Acids Res. 22, 4673-4680 ARMIN LAHM, RAPHAEL LEPLAE AND 29 Floeckner, H e t al. (1995) Proteins 23, ANNA TRAMONTANO 376-386 30 Sippl, M. J. and Floeckner, H. (1996) Structure Istituto di Ricerche di Biologia Molecolare 4, 15-19 (IRBM) P. Angeletti, Via Pontina Km 30.600, 31 Casari, G., Sander, C. and Valencia, A. (1995) Nat. Struct, Biol. 2,171-178 00040 Pomezia, Roma, Italy.
i
Letters to TiBS TiBS w e l c o m e s letters on any topic of interest. Please note, however, that previously unpublished data and criticisms of work
published elsewhere cannot be accepted by this journal. Letters should be sent to: Jo McEntyre, Trends in Biochemical Sciences
Elsevier Trends Journals 6 8 Hills Road Cambridge, UK CB2 1LA
281