Protein structure prediction. Implications for the biologist

Protein structure prediction. Implications for the biologist

BiewhimW {]997 ~70, 63 i -6~6 O Soci6t6 fram,'aise de biochimie el bioh~gie mol0cukme / |.ib;cxicr. Paris Proteh structure prediction, Implications f...

2MB Sizes 18 Downloads 140 Views

BiewhimW {]997 ~70, 63 i -6~6 O Soci6t6 fram,'aise de biochimie el bioh~gie mol0cukme / |.ib;cxicr. Paris

Proteh structure prediction, Implications for the biologist G Del6age. C Blancher, C Geourjon hl.~till~l~' Of Biology amt ('hcmistrv qf Proleilts, 7. pa.~sage d, Veruorx. 69367 Lyon ~'edex 07. Framu (Received 16 July 1997: accepted 18 September i 997

Summary ~ Recent impro~ emenls in the prediction of protein secondary structure are described, particularly Illose methods using the hfformation contained int~ mulliple alignment.,,, in this respecl, the prediction accuracy has been checked and methods that take into account multiple aligm'nenls are 70g correct for a three-stale description of secondary structure. This quality ix obtained by at "lea',e-one out" procedure on a reference database of proteins slmring less than 25',"i. identity. Biological applications such as "protein domain design" and structural phylogeny are given. The biologisl's pohll t~t"view is also considered and joint prediclions arc encouraged in order to derive an amino acid based accuracy. All the tools described in this paper arc available for Ifiologisls on the Web (http/w~,w.ibcp.fr/predict.htmll. protein structure prediction / sequence analysis / protein domain design / bh~computing / S{)PMA Introduction Secondary structure prediction from protein sequences is perhaps the most studied problem in coral:rotational molecular biology. In addition to its intrinsic interest to biology, it has also become a starling point for lertiary structare prediction. Historically, the numerous methods yet available (more than 20 have been developed) were classified in Ibm" different categories: i) the statistical methods; it) the similarity based methods; iii)learning methods; and iv)empirical methods. More recently, the incorl~oration of multil:~le alignment ol' related prolein sequences into predictive methods makes this classification obsolete and today we should consider illelhods thai are based eilher Oll single sequence or on multiple alignnlenl. In Ihis i~aper, we presenl these predictive methods with their exlensious and lilt' way the biologist (obviously riot I'amiliar with these computing metl~:~ds) should integrate the information of secondary structure into his experiments.

Single sequence based methods The first developed methods were statistical methods even if the size and the sampling representativity criteria were obviously not met. The pioneering work of Chou and Fasman I I-3 ] consisted in the derivation of the frequency with which individual residues or short sequences are found in given contbrmational states (helix, sheet, turn and a-periodic). Further improvements in this method have been achieved by distinguishing internal and external [.]-sheets 141, the integration of structural class prediction 151. a profile consideration in the special case of 1~-~-~ proteins 16l. Another approach 171 consisted in the application of

in formalion theory to protein sequence and secondary st ructure. In this latter method the predictive parameters b.axe been revisited and the pair information has been added 18 I. However, statistical methods fail to improve despite the increasing size of the re l~rence database 19l. About 1() years ago similarity based prediction methods appeared I I O. il I. Briefly, each polypepfide of the query sequence is compared with all peptides with the same length in a reference database. if both peptides are similar (by using a secondary structure substitution matrix), the known observed conformation is given to the query peptide with the similarity value. The process in allowed |o continue until all l;ep|ides have been compared anti tile s,cores are sttmmed tip. The final predicted stale ix the one with the highe:4 conlbrlna+ ti()nal set,re. This strategy has been used in ,,everal methods with a 17 amino acid Ionp peplide I I 2l ai|d COlllhinetl with an aulonlaled optimization l~l°occdure l l3l. More reccnlly several groul~s have used neural networks 114=161 with slighl, significant improvenlents over previously described methods.

Incorporation of multiple alignments A promising starting point for predicting a structure for a given amino acid sequence is Io determine whether that se~ quence is sufficiently similar to any other sequence for which biophysical data (ideally a 3D structure)are avail° able. In this context, several authors have investigated the possibility of improving the prediction accuracy by taking into account the information contained into a multiple alignment of related protein sequences 117-191. This approach has been applied to both statistical and similarity based methods l:or the prediction ot' el-helices II 9l, to sell'-

~2 optimized prediction methods 1181 and to neural network methods I201. One interesting feature in the Rost and San~ r method is a quality index given at each amino acid position [20] and that considering only the residues with highest scores may yield as much as more than 86% on about 30% of residues [211, It seems now that no significant improvements in secondary structure prediction will be obtained without taking into account the effect of tertiary structure onto local folding.

Comparison of predictive success The objective comparison of predictive accuracy has been addressed for a long time, it requires at least three problems to be solved: i) the reference database; ii) calculating the prediction accuracy; and iii) the reproducibility of methods developed onto different sets, The first problem to address is the definition of a reference database [221. In this context, the cut-off value of identity between protein sequence is critical and the cut-off yet admitted is 25% [ 16] (a value below which homology modeling is rather h~ardous). Another problem is to derive the secondary structure from atomic coordinates, It has been shown that using different criteria (geometric or energetic) may result in an overall

agreement as low as 70% 123]. The firsl accuracy index is the global percentage of correctly predicted residues Q% and has been used in an objective comparison of differentt methods 191. This parameter can also be used on a state-bystate basis, but the percentage of correct prediction does not give any information about the correlation between prediction and observation, That is why a correlation parameter has been introduced and applied to secondary structure prediction [241. This C parameter varies from - I (total anticorrelation) to I (total correlation) and is nil if there is no correlation between prediction and observation. More recently, the segment overlap [251 is a measure of how the predicted segments correspond to observed segments, This is the single method that is not based on a residue-byresidue comparison. The last but not least bias that must be avoided is the independency between learning and ~esting sets for methods based on neural networks. Indeed, even if a jackknife procedure could be applied (removing the protein from learning set prior its own prediction), this procedure seems to be too time consuming to be extensively used. A comparison of different methods is given in table 1. Methods based on a single sequence have a global success rate of about 60% when checked on the 126 protein chains sharing less than 25% identity I!6]. On the other hand, methods that exploit the information contained into the

Table I. Comparison of secondary structure methods on the Rost and Sander database, States

Co(!f

GOR ! 71

Levin I ! I I

DPM 151

SOPMA 118j

Pill) ! 161

SOPMA + PHD

o~-helix

Ca act SOV Lp Lo

0,42 10,92 0,609 8,2 9, I

0.53 9,06 0,601 7,9

0.45 !0.95 0,603 6.3 9, I

11.61) 8.5 9.3 9.1 72

I).65 0.706 7,5 8(),9

0,52 8.1 5,0 5.1 66

0,61 0,74 4.0 5.3 81.3

C'cd ~

Total

Q3 (q)

53,5

9. I o2

53.9

0.61 7,9 0.737 7,3 9,1 {~9.9

SOV Lp Lo Q3 (%)

8,99 0,578 3,9 5,1 38,8

8,34 0,533 3,6 5,1 40,5

8.49 0,536 4, I 5,1 42,2

O.M) 8 0.621 3.9 5.1 53.3

Cc ¢1c SOV Lp Lo Q3 (%)

0,49 8,49 0,528 8,2 5.8 69, I

0,41 8,75 0,592 7,8 5.8 74,2

0.53 9,5 0,585 7,6 5,8 72, I

Q3% C

57.6 0,562

63, I t),563

59.9 0,566

9

0.52 8.1

-

0.702

6.9 5,8 74,6

74

0,642 4.8 6,11 82.8

68.6 t},672

70.8 -

83.7 i76) 0,678

-

I).55 -

The pr~liclive success is calculated on tlae Rost and Sander database 1161 by using the classical index reported 1251, Brielly, Q is the ~ e n t a g e of corr~tly predicted residues, C is the correlation coefficient between prediction and observation iC = -! for negative correlation, C = 0 for absence of cor~lation and C = +1 for lx~sitive correlationL and ~ is the standard deviation between predicted secondmy content, Lp is the mean predic,,ed lenglh of predicted segments, Lo is the averaged length tot observed secondary structure and SOV is the segment overlap index,

683

200

220

210 I

I

230

hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

'

hhhhhhhhhhhhhhhhhhh~hhhhhb_hhhhh

AETDQLEEEKSAL arnier hhhhhl 11. FOS_MSVFB AETDQ

240

250

- -

I

h h'h h h h h h h h h hhhhhhhhhhh

........~ ~::

ILAAHRPACKMPEELRFSEE/AAATALDLGAP- - S

Primary struct. F¢) SX M SV FR. Prediction method Ga,'nier (parameters: - 100 Alpha Helix=39.3% =38.9% =7.8% =13.9% AA: number Score: Pr~

~SGL ~ETDKLEDEKSGLQi:EIEPI~

"4 V

"

~J

'L

nier hhhhhhhhhhhhhhhhhhh~ 15. FOSB MOUSEAETDQLEEEKAELE: 'EIAE ni or mary ond.

cons. cons.

hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh Q D A T E £! I A hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

! A

EEL2FSEEMS2A2SDLT(I<-;LPEAS ? hhhhh?h???? ?? ~

Message box

Fig 1. Tile ,nultiple protein sequence analysis (MPSA) program interface. The alignment of sequences that belong to the Fos family of transcription factors was pc,Tormed from within the MPSA program (version 0. I for Silicon Graphics) by using the CiustalW (1.6) package and secondary st,'uctt, rc predictions come either from local impleme,'ttatiorls of well known methods 17, 8, I !] or from the Network Protein Sequence Al,:dysis (hup://www.ibcp.fr/prcdict.hlml)I ! 8 I.

tuulliplc alignment of related i,otcins SCtlUCnCC~ such as SOPMA 1181 and PHI) 116] have global success rates of abot!t 70% in the Rost and Sander database, Thus multiple alignment and consensus prediction using profiles signifi~ cantly improve the prediction accuracy of protein secondary structure. What are the future direclions lbr secondary struc° tuft? From the biologist's point of view, it seems more important to know the correctness of prediction on some given stretches rather than to get a global prediction accuracy tbr the whole sequence. That is why some authors have investigated the possibility to give an accuracy index for each residue of a sequence. Very often, this index is based on the difference between the conformational state that has the highest value and the second value 18, 201. Other authors have chosen to investigate the possibility of joint p,'ediction !18, 271 or a combination of several methods based on different principles l l3l. This stresses the necessity for the biologist to c o m p a r e the results obtained by several methods (especially based on different principles), to make consensus prediction and thus to have the methods available either integrated into a single software or through lnternet.

l n t e g r a t i o , hlto packages anvil Web servers avai|ability In order Io give |he biologisl acces~ Io diffcrenl secondary slrtlClllremethods coupled with protein sequence analysis methods, a package called A N T H E P R O T thai integrates several prediction methods has been being developed for

several years 128,291. This program runs on PC computers (DOS, Windows o,' IBM rs6000) and is available for aca ,~ demic users (ftp.ibcp.fr). The program is being adapted tot other Unix machines (SGi, SUN, IBM) and Macintosh. This new program is called MPSA (multiple protein sequence analysis) program. The main window of MPSA showing the combination of multiple alignment with different secondary structure prediction is given in figure I. In order to investigate the conformational scores for each state (helix, sheet, turn and coil), a graphical view can be displayed. The graphics are fully interactive with cursors, scrolling, zoom, resizing and selection functions. The main window can be sent to a file in either Postscript Adobe or rich text tbrmat (RTF)-Microsoft formatted files. Great efforts have been made to allow large alignments (200 sequences of 5000

~4 +),s+,+ o++(pt+++~,<++ IETU

9

IGKY

1

3ADK

6

K P H V N V G T IG H V D H G K T T L T A A I T T V L A K T Y G

SRP IVl S G P S G T G K S T L L K K L F A E Y P D S F G F S V S STT

KKS KI I F V V G G P G S G K G T Q C E K X V Q K Y G Y T H L S -

Fig 2, Not¢in sequence analysis, strategy.

elements) to b¢ handled with g(~xt inleractivity. The cumbinalion of multiple aligrlments with secondary structure predic+ ,ions ix i,tegrated in a more general approach (fig 2). Moreover. a World Wide Web server (llttp:/twwvc.ibcp.fr/ predict.html} called NPSA (network protein seqttenc¢ all~i+ lysis) has bee, develol~cd which allotvs tile hioh~gis| |o per° fi~rm secondary structu.~ analysis by several metht~ds. A ¢o,sensu~ prod,clio, i~ generated and sent It) the hio!ogi..| b)+ e°mail (due to the CPU time required for predict,tin}. Ik~ay, mo~ than 27 l~} i~lUeSts ....... ha+vet been pr,~essed by our server and the day rate is about 50 requests per day. In the fittu~, the p~,ssibility of extracting sequences by using the ACNUC 1301 web server (http:flacnuc.univ-lyonl.IW start.html) will ~ added, thus allowing to make predictions onto a Swiss+pro, subset,

Applk'alions to biological projtmls In this chapter we pl~sent three difl~rcnt examples of secondary structu~ p~diction: some have been performed in the absence of 3D data. others have been combined with 3D structural data, The secondary structure is more often conserved than the sequence since the secondary structuw code is degenerated starting i ~ m 20 amino acids to finish with three ur tk}ur different conformational states. That is why some ,steres-

5 P21

1

MTEYKLVVVGAGGVGKS~TXQLIQNHFVDEYDPT

I

NBD 1

418

QSGQTVALVGNSGCGKSTTV~LMQRLYDPTEG~SVDG

NBD2 10 61

KKGQTLALVGSSGC(~KSTVVQLLERFYDPLAGKVLLDG

Fig 3. ('ompadxtm of prhuary and second+i~+,slnicturc ctm.~crva~ tion ill |lttclcotkle bi|tding region ~motff A) of P+glycoprolcin. The nuclct~Iktc binding region A motif(lAGbx t4)-(3-KqST[)has been searched in NRI+°3Ddatabase and ftmnd into four proteins: elon,gation factor (| ETU). guanylate kinase (IGKY). adenylate khmse (3ADK }and |)21ras ~5P21) and in hun;arl P-glycoprotcin. The 3D structure wa~ deduced |'t'om their atomic ct)ordinates by using the Kabsch and Sander method 1+22i. The grayed box entire|rag the last two sequences highlights the two predicted nucleotide binding domains (motif A) from the htllna|l Pgp. Open rectangles represent ~-shccts. filled black rectangles rcprcscn| a-helices; dotted rectangles rcp|'cscttt l]-Iurns.

ring infommtion c:m be masked ill the sequence and become visible after secondary structure prediction. A typical example of lifts feature ix Birch with Ii!e Ilu¢l¢otide binding domains of multith'ug resistance proteins (P-glycoprotein). In tile nuclcotide binding domains u[ difl'~reilt proteins, tile mnino acid zunscrvation (fig 3) is very low since only tin'co am/no acids t~ti! of" ?18 arc identical i.to the six seql|en¢cs.

These muino acid+,,ct,'respmld to Ihe requisite positions G-x ~4)+G+K tor mieleotide binding. When the observed sezonda,'y 51rll¢lure,s (first t'our sequences) are compared wilh predicted st,'UclUres (last two sequences), tile structural conservatio, is obvious. This type of investigation permits to validate the multiple alignment and the location of secondary structure elements and to postulate a typical Ross,nan fold for both nucleotide binding d o m a i n s of P-glycoprotein. In the AIDS field, secondary structure prediction coupled with multiple aligmnents has been used to predict the 'most likely' sequence of itlV gpl20 131] and to propose an e×planation at the secondary structure level for the difference in the dimerization rate of HIV-I and HIV-2 reverse transcriptases (HIV-RT) enzymes 132 I. When predicted secondary structures for HIV-I RT and HIV-2 RT are compared. one can clearly identify tile main difference, located in the 312-34{} region (fig 4). Thus we formulated the hypothesis that regio, 312-34{} could explain the difference in the dimerization rate of HIV-RT subunits 132]. Then peptides in

6~5

0

100 H|V1-RT-p66

:

200

300

400

500

,~

Ii I

0 100 amino acid number

200

300 ]

400

i

500 ~x H e l i x

Difference in secondary structure

[---2

13sheet

Fill 4. Predicted secondary structure of HIV-I and HIV-2 RT. The secondary .~tructure of H|V-I and HIV-2 RT p66 subunits was prediclcd by using tile SOPM nlethod I131. Closed and open boxes correspond to the ~x-hcliccs and [:Lstrantls. respectively.

o,

these regions were synthesized. The synflletic peptide from HIV-2 clearly folds into :m ~-helix whereas the HIV-I pep|ide is rather unfolded an observed by circular dichroism (Divita, personal commtmicalion). More striking, these synthetic peptides have been found to inhibit tile sul~unil dimerizalion. Anolher example in given by the design of Ihe I'unclional protein donlain. This i'Ccelll concepl cOMes with tile possio bilily Io delcrnline the lel'liary slruclure of molecules of limited size by NMR. I)ue to the modular organization of proteins, one should be able 1o predict tIle limits of 13iolt~gi~ tally active domains. In our laboratory, this approach has been successl'ully applied to the l'ructose repressor DNA binding domain and to heal shock factor. These domains (nrainly designed l'rom sequence analysis) have kept DNA binding capabilities and, its shown in I'igure 5, their structure has been solved by NMR 1331. The last application is the use of predicted secondary

/ /r

structure as a starting point lbr tertiary structure prediction or folds recognition. We are addressing this problem by a combination of simulated annealing, database sampling, Monte-Carlo and distance geometry methods, in some particular cases (protein mostly helical), structures that dilTer by less than 3.6 A from actual structure have been obtained and these results have to be generalized to a representative subset o1' the PDB. One of tile most pronlising challenges for tile next decades is probably the understanding of the folding process and the prediction ot' protein tertiary strut'lures from sequences.

J Fig 5. An example of protein doll'lahi design. The fl'UClOst~i'¢l'~rc.',.',,or DNA-binding domain has been designed from both sequence analysis and proteolytic experiments 1331.

Ack~wledgments The ~,ork presented in this review has been carried out wilh the support of IMABIO program (CNRS) and M E N E S R granls (ACCSVI3t.

References I Chou PY. Fasnmn GD 11974) Prediction of protein ct,nformation. Biochemist~' i 3.2 ! !-245 2 Chfm P'~; Fa:~man GD {It~781 Prediction of secondary structure of W¢tlein.,i from their ~mino acid ~iequence. A~h' Enz3m,o147. 145-147 3 Foeman GD 11989~ Development of the prediction of protein ~tructore. In: Predwthm of pn~win .stlTwttu~, and the print'o~ies of protein ~otll~,~uition tFasman G, edt Plenum. New York and London. 193 p 4 Garrali RC, Taylor WR. Thornton JM 11985t The influence of tertiary ,,tructure on ~econdary structure prediction, Accessibility verstL~ preo di¢iahility tbr beta ~tructure. FEBS l.ett 188, 59--62 5 Deleage G. Roux B 119871 An algorithm for protein secondary strut. tun prediction b a ~ d on class prediction. Pt~t ling I. 289-294 6 Taykw WR. Thornton JM t 1983t Prediction of super-secondary ,,truelUre in proteins. Nutur~' 301. 540-542 7 Gamier J. O s g u t h o ~ D J. Robson B ! 197,~) Aualy.~i~ of the accuracy and implications of ~imple methods tbr predicting the secondary ~trueture of globular protein~. J Mol Biol t 20, 9 % 1211 g Giant JE Gamier J, Robins g (1987) Ftmher developments of protein ~¢ondary ~|rllClUrg prediction using information theory. New par° am¢ler~ and consideration of t~sidue pair~ J Mol Bhg 198, 425 44.a ~ Kab~ch W. Sander C 119831 Ho~. g t ~ are prediction of protein ,,ec, ond~y ,~tructure? FEfl.~ In'It 155, 179~ 182 I0 Sweet R 1I986) Evolutionnar~. similarity among peptitle ,,egment in a h~!~i~tilt prediction of pmte{n foldhtg. Bi,#l,ulvm,,r.~ 25. 1565-1577 11 l,e%in JM, Rnb.~nll B. Gamier l ! 19861 An algori|hn! lilr neCOllt!ary .~tru¢lurt~ predi¢!lon t~ased on ,eqlte!l¢¢' :.itaihlrity, FEliX Left 2115, .~1)3--3118 12 !.evill JM, (iarnlt2r J t lqt'lt'l) illlprll'.'CUt¢iits il! a scctintlary ~tfucture pr~dif!it!n !ltelht~l hll~e_d on a ~¢lti~ll tile Inca! ,~¢qi!cilco Iliti!iilll~gi¢~ t!lid i!~ !l~t' a~ a t!itildill~ tunl, ttiochiol tlii,,phv,~ A,'/,i ~155, 2~3 295 I£~ Geuuiioa C, !~qCt!ge G {!tii!4) SOPM~ l~ nell-llplillli,~ei! pledi¢liotl Illt~lh~,l ii,~' ltttllt, ill ~'¢illldiu~ ~tfil~'tttlt, !~f~tlii:liou, t~i~i !:lit/7, 154 I f~-I 14 Qhul N Sgiltov¢~ki 1'] l !~}14~ltPrediclillg ti!e ~¢¢ondltry ,~!fll¢lti!oe ill ~ht htlh!r lt.ftllt!il!~ tl~i!l 7 nTural itc,ilViitl'ks lllod¢l%, ,I ~!til I1#ol l till. @J7 7{PJ !5 ~ht!tl 8 X, l%!i)~h~Iv It{ Wi!ltl DL 11!!2} Hyhrid system itti" I!fol01tt ~.5, 104% I 1~3 t6 Ro~I B, Sa!~!er C l l~;~) P~di~:tio!! of protein secondary strt!clure ~11 ~tter than 71F;~,,accuracy. J Mot Bio1232, 584~59ti 17 Levis JM, Pascarella S, A~os It Gari~iei' l i 1993i Quantification of se¢oltdary strucluit, p~diclion inlprovenlent using multiple aligm ment. Pt~l I~'~Ig6, 849=854 .

18 Geouljon C, Driftage (i 11905) SOPMA: significanl improvements in protein secondary structure prediction by consensus prediction fi'om multiple aliginnents, Compia Appl Biosci I !, 68 !-684 19 Donnelly D, Overinglon JP, Bklndell TL ( 1994} The predic|ion and orientatmn of ~ helices from sequence alignme~:ts: then comined use of environment-dependent substituion tables. Fourier translbrm nlethcxis and helix capping rules. Pant Eng 7,645-653 211 Rest B, Sander C 119941 Combining evolutionary information and neural networks to predict protein secondary structure. Prim, ins I% 55-72 21 Yi TM, Lander ES {1993) Protein structure prediction using nearestneighbors methods J Mol Biol 232, | ! 17-1129 22 Kabsch W, Sander C ( 1983} Dictionary of protein secondary structures: pattern recognition of htdmgen-honded and geometrical features. Biopolymers 22. 2577-2637 23 Colloc'h N, Eichcbest C, Thoreau E, Henrissat B, Mouton JP t 1993l Comparison of three algoritim~s for the assignment of secondary structure in proteins: the advantage of a consensus assignment. Pt~t i:ilg 6, 377 ~82 24 Matthews BW t 1~75~ (/~mpa~i,,,~m of the predi¢led alld obsclwed se¢~ ondary structure of T4 phage lyso~ym¢, tliochooa t~,y,hyx Acla .11~5. 442-451 25 Ro.~t B, Sander C { 1994! Redefining the goals of protein secondary structure prediction. J Mol Bioi 235, 13-26 26 Bmu V. Gibrat JE Le,dn JM. Robson B, Garnier J 11988~ Secondary structure prediclion: comhination of three different methods. Peel Eng 2. 185~ 191 27 Dddage G, Roux G 119891 Use of class prediction to improve protein secondary structure prediction: Joiut prediction with methods based olt sequence honlology In: Pro,diction o,f p~lein strlwtm'e ami the !,rim'iph.x ~!flmm.iu con]i~rmali, m (Fasman G, edl Plenum, New York and London, 587 p 28 GetlurjoB C, Deldage G { 1993} Interactive and graphic coupling I~etween multiple alignments, secondary structure prediction and motiflpallern scanning into protein sequences. ()|BIOS 9, 87-9 I 2(j Geourjon ('. Ileldage G t1'1951 AN°|'II|:2P~