Predicting Reliable Regions in Protein Alignments from Sequence Profiles

Predicting Reliable Regions in Protein Alignments from Sequence Profiles

doi:10.1016/S0022-2836(03)00622-3 J. Mol. Biol. (2003) 330, 705–718 Predicting Reliable Regions in Protein Alignments from Sequence Profiles Michael...

2MB Sizes 0 Downloads 35 Views

doi:10.1016/S0022-2836(03)00622-3

J. Mol. Biol. (2003) 330, 705–718

Predicting Reliable Regions in Protein Alignments from Sequence Profiles Michael L. Tress1*, David Jones2 and Alfonso Valencia1 1

Protein Design Group, Centro Nacional de Biotechnologia CNB-CSIC, Cantoblanco 28049 Madrid, Spain 2 Department of Computer Science, Bioinformatics Unit University College London Gower Street, London, UK

For applications such as comparative modelling one major issue is the reliability of sequence alignments. Reliable regions in alignments can be predicted using sub-optimal alignments of the same pair of sequences. Here we show that reliable regions in alignments can also be predicted from multiple sequence profile information alone. Alignments were created for a set of remotely related pairs of proteins using five different test methods. Structural alignments were used to assess the quality of the alignments and the aligned positions were scored using information from the observed frequencies of amino acid residues in sequence profiles pre-generated for each template structure. High-scoring regions of these profile-derived alignment scores were a good predictor of reliably aligned regions. These profile-derived alignment scores are easy to obtain and are applicable to any alignment method. They can be used to detect those regions of alignments that are reliably aligned and to help predict the quality of an alignment. For those residues within secondary structure elements, the regions predicted as reliably aligned agreed with the structural alignments for between 92% and 97.4% of the residues. In loop regions just under 92% of the residues predicted to be reliable agreed with the structural alignments. The percentage of residues predicted as reliable ranged from 32.1% for helix residues to 52.8% for strand residues. This information could also be used to help predict conserved binding sites from sequence alignments. Residues in the template that were identified as binding sites, that aligned to an identical amino acid residue and where the sequence alignment agreed with the structural alignment were in highly conserved, high scoring regions over 80% of the time. This suggests that many binding sites that are present in both target and template sequences are in sequence-conserved regions and that there is the possibility of translating reliability to binding site prediction. q 2003 Elsevier Ltd. All rights reserved

*Corresponding author

Keywords: sequence alignments; profile-derived alignment scores; reliably aligned regions; binding sites; alignment quality

Introduction Alignments are fundamental to all database search methods and most structure prediction techniques. The detection of similarity to a structural template, as well as all subsequent predictions about the structural, functional and evolutionary features of the target sequence, stems from the initial alignment between the target Abbreviations used: CASP, critical assessment of techniques for protein structure prediction; PDB, Protein Data Bank; RMS, root-mean-square. E-mail address of the corresponding author: [email protected]

sequence and the structural template. The alignments between template and target sequences are crucial for modelling, in threading, and for the prediction of secondary structure and functional residues. Poor quality alignments will link the target sequence residues to the wrong positions in the template structure and further studies based on the misaligned positions, such as 3-D homology modelling or the prediction of binding sites, will be affected. The CASP comparative modelling assessors1 recognized from the data submitted to the assessment program that the quality of alignments is still a key problem in comparative modelling. They also pointed out that even a high

0022-2836/$ - see front matter q 2003 Elsevier Ltd. All rights reserved

706

template-target sequence identity does not necessarily translate to good quality alignments. It is well known that as sequence similarity decreases it becomes more difficult to predict alignments accurately; Rost2 found that the quality of alignments drops off considerably below 30% of identical residues. However, the CASP assessors concluded that even at 50% identity the poor quality of some alignments may cause modelling programs to produce unsatisfactory models. Models for target structures ought to be at least as good as those resulting from the structural superposition of target and template as long as the alignments between target and template are good. In CASP4, however, the structural similarity between model and target was often much worse than that between target and template, and the assessors put this difference down to errors in the alignments. A number of recent studies have evaluated the quality of alignments produced by fold detection methods.3 – 5 Jaroszewski et al.6 used RMS deviation and contact map overlap to compare the quality of the alignments produced by PSI-BLAST,7 a pairwise sequence comparison technique and two techniques that aligned profile-to-profile. The profileto-profile techniques were found to be more accurate than the sequence to profile method PSIBLAST, which in turn was more accurate than the pair-wise sequence technique. Regions aligned identically by both the profile-to-profile technique and PSI-BLAST also scored considerably better than those regions aligned by just one method. Elofsson8 evaluated a range of alignment methods by comparing the models generated from alignments with the actual target structures and found that no single method or choice of parameters always produced the best alignment. It was almost always possible to produce a good alignment as long as the “right” choice of method and parameters was chosen. Given that there is no prior information to suggest which are the best method and parameters to use, it was suggested that alignments could be generated with a range of methods and parameters before selecting the “best” alignment on the basis of the alignment score. Another strategy for improving structure prediction is to include only those regions where the alignment is good, while leaving out those regions of the alignment that are the most divergent. The CASP4 assessors highlighted the non-prediction of the 38 residues at the N-terminal end of Target 92 by the Venclovas group9 as a good example of this. In this region the target and closest template were highly divergent so any prediction based on the template would have been very misleading. This ignoring of non-related regions was (rightly) rewarded by the assessors. It is vital therefore to create alignments that not only reliably align evolutionarily related regions, but also that do not align unrelated stretches of sequence. It is evident from previous studies that there is

Predicting Reliably Aligned Regions

scope for improving the quality of alignments between remotely homologous proteins. One method, as suggested by Elofsson, would be to pick out the best quality alignment from an identity parade of alignments produced by different means. Another approach would be to start from alignments produced by a single method and evaluating the reliably aligned regions. The classical method for predicting reliable regions in alignments produced by a single method involves producing a range of different, sub-optimal alignments of the same pairs. Those regions of the alignment that are identically aligned in the suboptimal alignments tend to be more reliable, while those regions where the alignment varies tend to be unreliable. Vingron & Argos10 pioneered the use of sub-optimal alignments in locating reliable regions in alignments. They showed that regions in the optimal alignment that were identically aligned in a large set of sub-optimal alignments generally agreed with the gold standard structural alignments. Chao et al.11 introduced the idea of a reliability score for aligned residues based on the degree of conservation of sub-optimal alignments and Mevissen & Vingron12 validated a modified version of this residue reliability score against a large database of structural alignments. Several recent publications have investigated ways of improving alignment quality, or at least identifying mistakes in alignments. Zhang et al.13 developed an algorithm that used residue alignment scores to separate well aligned sub-alignments from low scoring internal segments of the alignments. Jaroszewski et al.14 recognised the problems of aligning remotely related proteins, and concentrated on the means of choosing the best alignment from out of a set of sub-optimal alignments. Cline et al.15 compared a range of methods including near optimal sequence alignments, the closeness of residues to secondary structural elements and a score for each residue that reflected the likelihood of amino acid residues at each position. Near optimal alignment information was the most effective at predicting good quality alignments. Schlosshauer & Ohlsson16 created a reliability index for each pair of aligned residues by re-calculating alignments with a fuzzy “winner takes most” version of Needleman & Wunsch dynamic programming17 and showed that this reliability index can be used to predict the probability that a pair of residues are correctly aligned. This work introduces another potential method for predicting reliable regions in sequence alignments. It shows that even a simple method involving multiple sequence profiles can be used to predict reliably aligned regions in remotely related proteins with a reasonably high degree of accuracy. Information from the observed frequencies of amino acids in PSI-BLAST profiles built for the template sequences was used to score each pair of aligned residues and this alignment score was used to predict regions that were aligned correctly.

707

Predicting Reliably Aligned Regions

The ability to predict reliable regions in alignments is enormously important, since as previously mentioned, better quality alignments will inevitably lead to better quality models. This simple method was able to predict correctly aligned positions for alignments of remotely homologous pairs that were generated by 3DPSSM,18 CLUSTALW,19 GenTHREADER,20 IMPALA21 and SAM T99.22 The accuracy of these predictions was particularly high for residues aligned in secondary structural regions. We also explored the possibility of assigning levels of reliability to binding sites in the alignments, something that would be particularly useful in sequence homology-based function prediction. Many predictions for function are made on the basis of homology with remotely related structures.23,24 In these cases, if functional predictions are to be made with confidence, information on the reliability of the alignment at the functional sites would be particularly useful. Functional residues that are reliably aligned will provide strong support for the predicted function, while if the alignment at the functional sites is shown not to be reliable, the extrapolation of function from homology will be difficult to justify. We showed that the prediction of reliable regions was even more effective for those residues defined as binding sites. In fact, for those aligned positions where the target and template amino acid residues were identical, the proportion of binding sites that were predicted as being reliably aligned was over twice that of ordinary residues. This result was repeated for all five of the methods tested.

Results The quality of the alignments produced by the test methods was not compared, since this was not the point of the study. However, our results did concur with Elofsson8 in that no single method always produced the best alignment. The overall accuracy of the SAM T99 alignments was noticeably better than the other methods, probably because profile information from the target sequence, or a number of close homologues, is already built into the hidden Markov model of the template in the server library. Alignments from CLUSTALW were frequently less accurate than their counterparts from other methods, though this may be because CLUSTALW is a multiple alignment technique and is not meant to find relationships between remotely related pairs. Multiple alignment methods generally do not produce optimal pair-wise alignments because of problems of search space, and here the target sequence was often substantially different from the template sequence. It is also worth noting that all the alignment methods tested use some form of profiles or multiple alignments to generate their alignments and that alignment scores are automatically built into

the scoring scheme of both 3DPSSM and GenTHREADER, and both methods use the scores as part of the process of selection of their alignments, 3DPSSM as part of the overall final score and GenTHREADER as an input to a neural network. So GenTHREADER and 3DPSSM alignments were selected based in part on alignment scores derived from PSI-BLAST. Residue displacement relative to the SSAP structural alignments25 was calculated for each position in the test method alignments. Residue displacement was the number of residues between the actual position of a target residue in a test method alignment and the position of the same target residue in the SSAP alignment. For each method the residues were divided into bins based on the residue displacement and the structural designation (strand, helix or loop) read from the DSSP files.26 The results (Figure 1) showed that no matter which method was used to align the two sequences, the smoothed profile-derived alignment scores calculated for the correctly aligned residues were considerably better than they were for those residues that were aligned incorrectly. It was noticeable that correctly aligned strand residues always scored highest. The mean profilederived alignment score for correctly aligned helix residues was always lower than the equivalent scores for strand and loop residues. The difference between the mean profile-derived alignment scores for correctly aligned strand residues and those for correctly aligned helix residues were fairly constant across all methods, ranging between 0.4 (CLUSTALW and SAM T99) and 0.52 (IMPALA). Figure 2 shows the breakdown of the profilederived alignment scores for all the template residues aligned by one of the test methods, 3DPSSM. Residues were pre-classified as helix, loop or strand and each aligned residue was then classified by its profile-derived alignment score and its displacement relative to the SSAP structural alignment. At each increment in the profile-derived alignment score the percentage of residues that were correctly aligned increased substantially for all structural groups, though the effect was more marked with the helix and sheet residues. The distributions were similar for the alignments produced by the other four tested methods. A higher proportion of those strand residues that were misaligned were within one or two residues of the SSAP alignment, while substantially less were misaligned by three or four residues. With helix residues the opposite was true: substantially more of the misaligned helix residues were shifted by three or four residues with respect to the SSAP alignment. A much higher proportion of the loop residues (43.6% compared to 29.8% of the misaligned helix residues and 24.1% of the strand residues) were gapped in the test method alignments. Predicting reliable stretches of alignment Given that for all of the methods tested those

708

Predicting Reliably Aligned Regions

Figure 1. Profile-derived alignment scores broken down by residue displacement. All positions in all test method alignments were categorised by the distance between the aligned residue in the test method alignments and the aligned residue in the SSAP structural alignments. The residues were further broken down by structural type. Mean profile-derived alignment scores are shown for each group of residues for (a) CLUSTALW, (b) GenTHREADER, (c) IMPALA, (d) SAM T99 and (e) 3DPSSM.

template residues with helical or strand-like secondary structures and scores above 6 were almost always correctly aligned, it should have been fairly simple to predict correctly aligned residues. Unfortunately, there were very few template residues with a profile-derived alignment score that scored so highly, for example only 1.6% of 3DPSSM residues had a profile-derived alignment score of greater than 6. However, a glance at plots of profile-derived alignment score and residue displacement relative to the structural alignment (Figure 3), shows that correctly aligned template residues are not distributed randomly and tend to be grouped as islands of consecutive correctly aligned residues. In addition, these islands tend to coincide with peaks in the profilederived alignment scores. This suggests that if these peaks can be found, they might be useful for predicting islands of reliably aligned residues.

High-scoring regions, as described in Materials and Methods, were defined as those regions that contained at least two consecutive template residues with a profile-derived alignment score above 4 and extended in both directions along the alignment until the profile-derived alignment score fell below 2. High-scoring regions were selected in this way for all the alignments produced by all five methods tested. For three methods (CLUSTAL, SAM T99 and 3DPSSM) the method was able to predict with 94.95% accuracy or greater, which helix and strand residues were reliably aligned (Figure 4). Over 92% of predictions for loop residues were correctly predicted as reliably aligned too. The coverage ranged from 41– 48% for strand residues to 32– 38% of helix residues. For GenTHREADER and IMPALA the predictions agreed slightly less often with the structural alignments, though the coverage of the these

Predicting Reliably Aligned Regions

709

Figure 3. Alignment score and residue displacement plotted against residue position. Profile-derived alignment score (Alignment Score, black) and residue displacement (Distance in red) for each aligned residue were plotted against increasing residue number, starting from the N-terminal end of the alignment. Negative residue displacement indicates the position of gaps in the alignment. The charts show (a) the CLUSTALW alignment of the pair 2rspB and 1fmb, and (b) the SAM T99 alignment of 1mil and 1lkkA.

Figure 2. The distribution by profile-derived alignment score and residue displacement. The distribution of displaced residues at each profile-derived alignment score bin is shown for (a) strand residues, (b) helix residues and (c) loop residues for 3DPSSM alignments. All residues were grouped by profile-derived alignment score and by residue displacement relative to the SSAP structural alignment.

regions predicted to be reliable was higher. GenTHREADER and IMPALA tended to have higher profile-derived alignment scores than the other three methods, and the difference between the mean profile-derived alignment scores of correctly and incorrectly aligned residues was not as great. Prediction accuracy for GenTHREADER and IMPALA could be improved by employing different parameters in the calculation of the high scoring regions, but only at the expense of the coverage of the predictions. IMPALA generates the highest scoring alignment between a target sequence and a PSI-BLAST profile of the structural template. This profile of the structural template is the same one that is used to calculate the profile-derived alignment scores, so, in general, IMPALA alignments will score more highly, sometimes even when the residues are aligned incorrectly. This will mean that it is more difficult to distinguish correctly predicted reliably aligned regions from those predictions

710

Predicting Reliably Aligned Regions

The reliability of binding sites

Figure 4. Prediction accuracy and coverage for reliably aligned regions. Reliably aligned regions were predicted from high scoring regions of profile-derived alignment score as in the methods. Results are shown for (a) the percentage of all correctly aligned residues that were predicted to be reliable (% coverage), and (b) the percentage accuracy of the predictions. Results are shown for all five alignment methods and broken down by secondary structural designation.

that are incorrectly predicted as reliable. Despite this obvious flaw, the approach to the prediction of reliable regions works surprisingly well, at the expense of a clear increase in false positives. As a comparison we also evaluated reliable regions in PSI-BLAST alignments. In contrast to IMPALA, PSI-BLAST uses a profile of the target sequence in generating its alignment, so the alignments were not evaluated using the same profile. In this case the coverage and accuracy of predicted reliable regions was almost identical to the coverage and accuracy of reliable regions predicted for CLUSTALW. The accuracy for predictions made for PSI-BLAST alignments was 2.4 –3.5% better than it was for IMPALA alignments, while the coverage decreased by just 2.3 –4.4% depending on the secondary structure type.

Profile-derived alignment score were calculated for all the alignments containing template structures with RCSB PDB-defined27 binding sites. The residues in each alignment were divided into sites and non-sites and on whether or not the residue was aligned correctly in relation to the SSAP alignment. A mean profile-derived alignment was calculated for each group. The results (Figure 5) show that the mean profile-derived alignment scores for correctly aligned binding sites were considerably better than they were for correctly aligned non-binding site residues. The result was repeated for all alignment methods (ranging from 0.83 of a point for CLUSTALW alignments to 0.94 for SAM T99 and 3DPSSM alignments), confirming that the binding sites, and the sequence around binding sites, tend to be more conserved than for ordinary residues. With the exception of SAM T99, the profile-derived alignment scores for misaligned binding sites were noticeably worse than they were for misaligned residues that were not binding sites, though the numbers of misaligned sites was low, particularly with SAM T99 (where there were just 42 misaligned binding site residues). The total number of binding sites evaluated ranged from 481 with the 3DPSSM alignments (4.2% of the total residues for those pairs that have binding sites) to 801 from the CLUSTALW alignments (3.9% of the total residues). Depending on the method, between 62% and 93% of the binding site residues were aligned identically in the SSAP structural alignments. If both the target and template sequences do possess equivalent binding sites, the residue at the binding site is most likely to be the same amino acid. In fact over 50% of binding sites that are aligned identically by both sequence and structure

Figure 5. Profile-derived alignment score for binding sites and non-sites. The chart shows the average profilederived alignment scores for those binding sites (sites) and non-sites (rest) that were identically aligned with the SSAP structural alignments (correct) and for those binding sites and non-sites where the SSAP alignment differed (wrong).

Predicting Reliably Aligned Regions

methods have the same amino acid residue. This is a good indication that binding sites that are predicted to be reliably aligned may be involved in binding in both the template and target sequences. As a comparison, amino acid residues at binding sites were identical less than 14% of the time in those cases where the alignment between test method and structural alignment differed. If just those aligned residue pairs with identical template and target amino acid residues are considered, the profile-derived alignment score for correctly aligned binding sites is even greater. The profile-derived alignment score for those correctly aligned residues that are defined as binding sites and that are aligned to the same amino acid residue in the test method alignment averages between 4.1 (CLUSTALW) and 4.51 (IMPALA). The scores are more than a point greater than those correctly aligned residues with identical amino acid residues that are not binding sites. Predicting reliably aligned binding sites This time high-scoring regions were defined as those regions that contained at least two consecutive template residues with a profile-derived alignment score above 3.25, and the high scoring regions were extended in both directions along the alignment until the profile-derived alignment score fell below 2.75. These were the cut-offs that produced the best results among the five test methods. High-scoring regions of the alignment were selected for all the alignments produced by all five tested methods. Predictions were made for those residues defined as binding sites (between 481 and 801 residues depending on the method). Using this calculation for all binding sites, it was possible to predict which binding sites were aligned correctly against the structural alignments with 95.4% to 99.1% accuracy and between 63% and 67% coverage. The corresponding figures for all residues were 91% to 96.8% accuracy and between 39% and 46.8% coverage. When the same calculation was done for just those binding sites where the aligned residue was identical (between 211 and 318 residues depending on the alignment method), at least 80% of the binding sites (and as many as 88.3% of the binding sites aligned by IMPALA) were predicted as reliably aligned (Figure 6). Those binding sites that were aligned identically by both SSAP and the test methods were predicted with at least 97.4% accuracy. For CLUSTALW and SAM T99 the accuracy was greater than 99%. As a comparison, for those residues aligned with identical amino acid residues that were not binding sites, between just 27.8% and 43.1% of the aligned residues were predicted as being reliably aligned. The numbers of residues involved in the prediction of reliably aligned binding sites were low, but the improvement in the accuracy and coverage of

711

Figure 6. Coverage and accuracy of predictions of reliably aligned regions. For those residue pairs where identical amino acid residues were aligned only. C1 is the percentage of binding sites that were predicted as reliably aligned for each method, while A1 shows the accuracy of those predictions against the SSAP structural alignments. C2 shows the proportion of non-binding site residues that were predicted as reliably aligned and A2 indicates the accuracy of the predictions for non-binding sites.

predictions with respect to non-binding site residues was consistent for all five methods. One template – target pair was the short-chain dehydrogenase/reductases, 1ybvA (trihydroxynaphthalene reductase) and 1a4uA (alcohol dehydrogenase). There are 23 binding sites defined for the template chain (1ybvA; Figure 7), 11 for the active site and 12 that bind NADPH. One residue is recorded as a binding site for both (Tyr178). The RCSB PDB file of the target protein (1a4uA), by contrast, only records three residue binding sites

Figure 7. The structure of trihydroxynaphthalene reductase A chain (1ybvA). The helices are shown in pink, sheet residues in light blue. The residues involved in active site binding are red and those involved in binding the cofactor NADPH are shown in royal blue.

712

for the active site and four that are used in the binding of NAD/NADP. In the GenTHREADER alignment all three PDB defined active site residues from alcohol dehydrogenase align with binding sites from the template protein. Two of them align with identical amino acid residues (Tyr178 with Tyr151, and Lys182 with Lys155). In these two cases the SSAP alignment is identical and the profile-derived alignment score is high (4.98 and 5.78). However, GenTHREADER aligns the third target active site of the alcohol dehydrogenase catalytic triad (Ser138) with the binding site Ile165 from the trihydroxynaphthalene reductase active site. Not only is the amino acid residue different in this case, but the residue profile-derived alignment score is very low (2 1.6) and the two sequences are almost certainly misaligned here due to the reluctance of GenTHREADER to insert a gap. SSAP does insert a single gap and aligns Ser138 from 1a4uA with another the binding site from the template 1ybvA, Ser164, resulting in a residue profile-derived alignment score of 3.95. This is almost certainly the correct alignment. It is worth being cautious with the definitions of sites in the RCSB PDB, however. According to the PDB file of 1ybv, Tyr178 (equivalent to Tyr151) is supposed to be both an active site residue and an NADPH binding site in trihydroxynaphthalene reductase, while Lys182 (equivalent to Lys155 in alcohol dehydrogenase) is supposed to bind only the co-enzyme. This seems to be an error in the file, the two residues do not contribute directly to coenzyme binding; these two residues plus Ser138 (to use the numbering from alcohol hydrogenase) in fact form part of the active site in all short chain dehydrogenase/reductases.28 Filling et al.29 proposed the involvement of a fourth conserved residue (Asn111) in catalysis. GenTHREADER aligns Asn111 to the equivalent residue (Asn138) from the template 1ybvA, as does the SSAP structural alignment. The profilederived alignment score for the residue is 4.25 and the surrounding residues also score highly, so if we already knew that this residue was involved in the active site in trihydroxynaphthalene reductase, we could surmise that it was also involved in the active site of alcohol dehydrogenase. Only one of the four alcohol dehydrogenase NAD/NADP binding sites aligns with a binding site in the template (Asp37 from the target with Ala 61 from the template) with the GenTHREADER alignment. This does not agree with the SSAP alignment either and also has a low profile-derived alignment score (2 1.9) so is unlikely to be a binding site in alcohol dehydrogenase. The Gly-X-Gly motif at residues 16– 18 is part of the nucleotide-binding site in both proteins28 and the GenTHREADER alignment agrees with the SSAP alignment for this motif. The protein derived alignment scores for the residues in the motif are 4.68, 5.93 and 4.70, which would suggest if we knew nothing about alcohol dehydrogenase that

Predicting Reliably Aligned Regions

they too are aligned correctly and are probably involved in binding the coenzyme as they are in trihydroxynaphthalene reductase. The other numerous residues involved in coenzyme binding in the two proteins are not identical, possibly because trihyrdroxynaphthalene reductase binds NADPH while alcohol dehydrogenase binds NAD/NADP. Can profile-derived alignment score be used to predict alignment quality? While residue profile-derived alignment scores can be used to predict reliable regions in alignments between pairs of sequences, the total profile-derived alignment score for an alignment (calculated from the sum of the residue profilederived alignment scores) does not seem to be able to be used on its own to predict alignment quality. However, this total profile-derived alignment score can be used as a rough indicator of alignment quality, as the plot of total profilederived alignment score and alignment quality for CLUSTALW shows (Figure 8). The correlation coefficient between alignment quality and total profile-derived alignment score per pair was 0.85. GenTHREADER total profile-derived alignment scores Work was done on GenTHREADER alignments to investigate those cases where total profilederived alignment score and alignment quality did not correlate well. Many of these differences came about because while SSAP based its alignments on 3-D positioning, the alignments produced by GenTHREADER were influenced by evolutionary features. The bias of the composition of the PSI-BLAST profiles used to score the profile-derived alignment scores also played a role in creating discrepancies between alignment quality and profile-derived alignment score.

Figure 8. Plot of profile-derived alignment score and alignment quality for CLUSTALW. Alignment quality (measured as the percentage of aligned residues that are identically aligned in the SSAP alignment accuracy) is plotted against the total profile-derived alignment score for each pair of sequences aligned by CLUSTALW.

713

Predicting Reliably Aligned Regions

In those cases where alignment quality was good, but the profile-derived alignment score was surprisingly low, it was often a feature of the score calculation that lead to the discrepancy. Poor profile-derived alignment scores could be put down to a number of factors. For example, template PSIBLAST profiles that were calculated from just a few, very similar sequences (e.g. 1tiiD-3chbD) would score identical sequences well, while remotely related sequences would score poorly, however good the alignment quality. There were also instances where SSAP did not align units/domains or regions correctly that were not related, while GenTHREADER did (erroneously) align the non-homologous regions. Here the total profile-derived alignment score from the test method alignment suffered because of the poorly aligned regions, but the alignment quality, as measured against the SSAP alignment, was not affected (e.g. 1a87-1cii, 1wit-1tlk and 1iakA-1iakB). This effect was particularly noticeable if the erroneously aligned region was relatively large with respect to the whole alignment. There were a number of cases where the profilederived alignment scores were good, but the alignment quality surprisingly poor. Pairs with secondary structural repeats (e.g. 1bhe-1a3h, 1lxa-1thjA) often confused alignment measurement. Unusually, while most SSAP structural alignments had better profile-derived alignment scores than their test method counterparts, SSAP alignments for these repeat structures all scored worse than the equivalent test method alignments. Here the alignment scores were high for the GenTHREADER alignments while the score for alignment quality approached zero. SSAP also occasionally aligned unrelated regions merely because of coincidental 3-D proximity. This happened when SSAP was only able to align one of two (or more) pair of domains/units in 3-D space, while the other domain or unit was unalignable because of a kink in the joining chain (such as with 1iakB-1agdA and 4enl-1mucA). On occasions SSAP also produced casual alignments between two entirely unrelated regions (e.g. 1aoeA-1vdrA, 1bcfA-1ryt). In those rare cases where these casually aligned regions contained strong residues, GenTHREADER would be penalised for “misaligning” residues that were incorrectly aligned by SSAP in the first place. In some pairs SSAP aligned whole secondary structural elements differently to GenTHREADER (particularly helices, which were clearly shifted by three or four residues). In these cases (such as the pairs 1a0p-1ae9A, 1cb2A-1tml, and 1awd-2pia) the SSAP alignment often looked poor and was characterised by lots of gaps. This particular difference in opinion between sequence and structural alignment occurred when regions that had high sequence similarity did not superimpose due to some conformational difference between target and template structures. One such example was the SSAP alignment of 1fsz and 1tubA. The fifth

helix of the target protein 1fsz was shifted by four residues in alignments produced by all five test methods. The test methods aligned the helix differently to the SSAP alignment because of the high sequence similarity at the start of the helix (GGGTGTG… versus GGGTGSG…). In the 3-D structural alignment the high sequence similarity region was not aligned. A small conformational difference between the two structures had caused one helix to be shifted in relation to the other, so SSAP aligned the two structures four residues further along the helix, an example of conflicting sequence and structural alignments. This reinforces the fact that measuring alignment quality using structural alignments (or some other structure-based measure) may not always be ideal when dealing with very remote homologues. No method is a perfect measure of alignment accuracy; however, estimations of alignment accuracy are always method-dependent, and differences become especially noticeable when the alignments are between remote homologues. While SSAP structural alignments were a good measure of alignment quality in most cases, they were less useful for measuring alignment accuracy for those remotely related pairs whose structures had the same topology, but had evolved to such an extent that twists, bulges or other conformational changes meant that they no longer superposed easily. Ideally, what is needed is a benchmark set of structural alignments agreed upon by a consensus of structural prediction methods and measures of model accuracy. A consensus method as a comparison We also investigated a second potential method of predicting alignment reliability. Residues were predicted as reliably aligned in those cases where all five methods agreed on the aligned target residue. This method had one big drawback in that it would only work for those pairs where every method produced an alignment, a total of just 152 pairs. The results from the consensus of the five methods were very similar to the results from the profile-derived alignment score evaluation of SAM T99. However, initial results suggest that the predicted reliable regions were not exactly the same. In general the consensus of the five methods predicted a higher percentage of reliably aligned residues for those pairs that had better alignments, while it tended to do less well than the profilederived alignment score for those pairs that had fewer reliably aligned regions.

Discussion There are certain similarities between this approach and the measurement of alignment reliability based on data from sub-optimal alignments.30 In one method, aligned positions are seen as reliable where aligned residues are

714

conserved over many alignments of the same two sequences. In the other, aligned positions are regarded as reliable where aligned residues are conserved over a single multiple alignment of related sequences. This similarity maybe is not so surprising. Regions of remotely related sequences that have high levels of sequence similarity will be conserved over many sub-optimal alignments of the same pair. These regions will also be conserved in a multiple alignment of a large number of sequences from the same family, as long as the regions of similarity have not arisen by chance and serve some structural or functional purpose within members of the family. Once the PSI-BLAST profiles are generated for the template proteins, the calculation of reliable regions in alignments is remarkably simple and effective. Knowledge of which regions are likely to be correctly aligned and which regions may be misaligned will be especially useful as a guide in threading and modelling. If a region of an alignment or indeed a whole alignment is reliably aligned it should be possible to perform further work on the basis of the alignment, and misaligned regions can be realigned in order to improve the alignment. The prediction of reliable regions in alignments will also be helpful in the study of binding sites and in terms of predicting function for unknown sequences in a structural genomic context. One area where the calculation of profile-derived alignment scores may be of some use is in automatic discrimination of good alignments from bad ones. Those alignments that had low total profilederived alignment scores (derived from the sum of their individual scores) tended to have poor quality alignments and would generally be unsuited as the basis for models. One example would be the GenTHREADER alignment for the pair 1rsy-1djxB, where GenTHREADER actually aligns the target 1rsy with the wrong domain of the template, 1djxB. The poor alignment score (0.08) and the fact that it contains just one reliably aligned region would be enough to suggest that the alignment did not merit further study. As well as being able to predict reliable regions in alignments, the method can also help pinpoint likely binding sites in the query sequence. The method is particularly effective at predicting alignment reliability for binding site residues, especially when the aligned target and template residues are identical amino acid residues. However, prediction of binding sites using profile-derived alignment scores would need to be a multi-step process, since changes in specificity that have evolved to accommodate new substrates or subtle changes in function are often accompanied by changes to the residues in the active site. The more remote the relationship the greater the changes,31,32 so it cannot be assumed that just because the template structure has a binding site at one position in the template sequence there is an equivalent binding site in the target sequence.

Predicting Reliably Aligned Regions

The first step would be to locate the binding sites in the template structure. The second step would be to use the profile-derived alignment score to pinpoint regions of conserved sequence. If a residue that is a known binding site falls within the sequence conserved region and the residue that it aligns with is identical (or has similar characteristics), the aligned target residue is almost certainly correctly aligned and is probably a binding site too. Binding sites that were aligned with the same amino acid residue and that agreed with the SSAP alignment tended be in highly conserved, high scoring regions (more than 80% in our results). This seems to be a precondition of a correctly aligned binding site, that it is in a sequence conserved region. The third step would be to certify whether the binding sites are predicted correctly by checking the whole alignment, the resulting model and the literature. There is certainly room for refining the scoring scheme used in this work. For example the scoring of gaps is fairly crude and the cut-offs used to calculate the reliable regions are a trade-off. The accuracy of the prediction of reliable regions can be improved by altering the cut-offs, but only at the expense of the percentage of reliable regions predicted. The scoring scheme might also be improved by introducing prior knowledge of secondary structure (the prediction of strand residues in particular seems to be a lot more reliable than the prediction of loop regions) or by refining the calculation method. The scoring scheme should also work if profiles from other profile-based methods, such as hidden Markov models, are used to score alignments. Indeed results with the alignments returned by the SAM server suggest that carefully curated hidden Markov models may even perform better than PSI-BLAST profiles in this regard. Of course, errors in the profiles used to score the alignments are bound to occur. Most errors will happen on a small scale and their effects will not be obvious, but if the errors in the alignments in the PSI-BLAST profile are serious, the profile will generally lose some of its potential to recognise reliably aligned regions. On those occasions where the profile is badly misaligned and the alignment being evaluated is based on information from the same profile, as was the case with some IMPALA and GenTHREADER alignments, the number of false positive predictions for reliable regions may increase. However, it is perfectly possible to base the evaluation of reliable regions on a profile of a template sequence where the alignments are generated from a profile of the equivalent query sequence, as the results with PSI-BLAST demonstrated. There were a number of structures that at present have few or no relatives in the sequence databases. Since this method calculates scores based on profiles built from multiple alignments, scores calculated from profiles with few or no sequences will clearly not be applicable. However, this may

715

Predicting Reliably Aligned Regions

improve as the sequence and structural databases grow. Although it seems that profile-derived alignment scores alone cannot be used to predict alignment quality directly, they are certainly a useful guide to the relative quality of an alignment. So as well as opening the door to further work based on the prediction of reliable regions and binding sites, profile-derived alignment scores might also be used, in conjunction with other measures, in fold prediction.

Materials and Methods The benchmark set of structurally related pairs The set of remotely homologous pairs used in the study was created from the April 1999 version of the FSSP database.33 The FSSP database is a set of structurally characterised proteins organised into unique families of evolutionary similar structures. The April 1999 version contained 1621 unique families, each with a representative protein chain and a list of the chains that were found to belong to the same structural family by the structural comparison program DALI.34 Pairs that had a high structural similarity but low sequence similarity were selected from the FSSP database, with a maximum of one pair per FSSP family to avoid unbalancing the data set. The pairs of proteins in the benchmark set had structural Z-scores ranging from 10 to 44.6, so DALI recognises a strong structural similarity. However, as a precaution, the pairs were also checked against the SCOP35 database for confirmation that they belonged to the same structural superfamily. Hadley & Jones36 showed that structural databases sometimes do not agree with each other, even when dealing with pairs of proteins that are supposed to be strongly structurally similar. In addition, the 413 pairs that made up the set had low sequence similarity. The percentage identity of the pairs ranged from 10% to 41%, but the majority of the pairs had percentage identities of between 10% and 25%, so that their alignment would present certain difficulties. Of the 413 pairs of structurally characterised proteins, one (the family representative) was designated as the “target” or query protein, while the other played the role of structural “template”.

sequences are threaded onto a library of known protein structures and the resulting sequence-structure alignments are scored using the 1-D and 3-D-PSSMs that are generated for the target and template sequences, the matching of template and target predicted secondary structure elements, and the solvation potentials. Each pair was submitted to the 3DPSSM server and the default parameters were used. A total of 219 pairs of alignments returned by the server were compared in the study. CLUSTALW alignment generation The CLUSTALW multiple alignment program has three stages. First all the collected query sequences are aligned in pairs and a “distance matrix” is calculated from the divergence between the two sequences. A tree based on the divergence between the pairs is computed from the distance matrix and the sequences are then aligned in the order described by the branches of the tree. Sequences were collected from the HSSP database37 for each of the targets, the FSSP template sequence was then added to the collected sequences and CLUSTALW, version 1.82, was run with the default settings. CLUSTALW produced alignments for 397 of the pairs. GenTHREADER alignment generation GenTHREADER uses a combination of sequence profiles, a neural network and pseudo-energy scores (pair and solvation potentials) to predict evolutionary relationships with known structures. It calculates alignments by “threading” target sequences onto a library of folds and using a neural network to score the threaded structures. GenTHREADER was run locally for all 413 pairs in the benchmark set. GenTHREADER generates sequence profiles for each of the targets with PSI-BLAST (ten iterations at a conservative cut-off E-value of 0.001 with a local non-redundant database). GenTHREADER produced 333 alignments, all of which were compared in the study. IMPALA alignment generation

A range of techniques were chosen to create alignments for each of the pairs in the benchmark set: the multiple alignment program, CLUSTALW, the fold recognition programs GenTHREADER and 3DPSSM and the sequence profile-based methods, SAM T99 and IMPALA. IMPALA, GenTHREADER and CLUSTALW alignments were generated locally, those for 3DPSSM and SAM T99 were generated by submitting pairs of sequences to the servers.

IMPALA is a profile-based alignment program. It compares single query sequences against a database of PSI-BLAST profiles pre-generated for all the template sequences in a structural database. Because it searches a reduced database it can use the more rigourous Smith– Waterman algorithm to generate alignments between the query and template sequences. PSI-BLAST profiles were generated for the 1621 representative sequences in the April 1999 version of the FSSP database. PSI-BLAST was run for four iterations with a liberal cut-off E-value of 0.01 on the May 2000 version of the NRDB90 non-redundant database.38 The database of template profiles was then searched with the query sequence from each pair. Only those pairs that had alignments with E-values of less than 0.001 were included, leaving a total of 319 IMPALA alignments from the benchmark set in the study.

3DPSSM alignments

PSI-BLAST alignment generation

3DPSSM is a protein fold recognition program that combines both 1-D and 3-D sequence profiles with secondary structural information and solvation potentials similar to those generated in GenTHREADER. Query

PSI-BLAST is more or less the reverse of IMPALA in that sequence profile is built around the target sequence instead of the template sequence. PSI-BLAST was run against the May 2000 version of the NRDB90 non-redun-

Creating the alignments for accuracy prediction

716

dant database for six iterations and with a liberal cut-off E-value of 0.01. Only those pairs that had alignments with E-values of less than 0.001 were included, a total of 323 alignments. SAM T99 alignments SAM T99 is another sequence profile-based method, but it uses hidden Markov models instead of positionspecific score matrices. Hidden Markov models are built for a library of templates of known structure by iteratively searching a non-redundant sequence database and aligning those sequences found below a certain threshold. The query sequence can then search this profile library. All the pairs in the benchmark set were also submitted to the SAM T99 server. A total of 269 alignments returned by the server were evaluated. Alignment comparison Though it would have been possible to make a direct comparison of the alignments produced by each method by including in the evaluation only those pairs for which every method produced an alignment, we felt that a comparison between alignment methods requires a specialised paper, such as those papers already referenced in this study, in which a full range of parameters are used to explore alignment generation. Secondary structure regions The designations of secondary structure that were used in the comparison were derived for each template from the DSSP files where possible. For the purposes of the calculations and for simplification, each residue in every template sequence was designated one of three secondary structure categories: helix, strand or loop. Binding sites Residues involved in binding sites were derived from the RCSB Protein Data Bank (PDB). Binding sites in the PDB are annotated by crystallographers when they are apparent. The definition of a binding site is determined by human experts, but it is not homogeneous. The number of binding sites per structure varies considerably and many PDB files do not contain any binding site information. For each of the template structures all those residues recognised as binding sites in the template PDB header files were included. 115 of the 413 templates in the benchmark set had binding sites defined in their PDB header file, with between 1 and 53 sites per structure. The 115 pairs contained 1159 binding sites in total. Creating the profiles for scoring the alignments PSI-BLAST sequence profiles were generated for each of the template sequences in each of the 413 pairs. The profiles were generated by running PSI-BLAST for four passes of the local non-redundant database with a generous E-value iteration cut-off of 0.01. The PSI-BLAST profiles were converted into numerical form with first step of the IMPALA process. These matrices were used to score each position in the alignments generated by each of the five test methods.

Predicting Reliably Aligned Regions

Calculating the profile-derived alignment scores Every residue in each template PSI-BLAST profile matrix has a set of 20 associated scores, one for each amino acid. The scores are a measure of the probability of each amino acid aligning at each residue. So for each aligned target residue in an alignment a score can be extrapolated from the PSI-BLAST profile matrix, resulting in a string of alignment scores for each alignment. Alignment scores were generated in this fashion from the template profile matrices for every alignment produced by each of the five test methods. These scores, plotted against increasing residue number starting from the C-terminal end of the template sequence, generally produced a jagged series of peaks and troughs such as the example shown in Figure 9(a). However, these peaks and troughs can be smoothed by combining the alignment scores from adjacent template residues. To calculate a smoothed profile-derived alignment score for each template residue in an alignment a triangular smoothing window of five residues was used. Gaps were penalised by a single penalty of 2 200 for each residue. The simple formula for calculating the smoothed residue scores was as follows: Score ¼ Saðres22Þ þ 2 £ Saðres21Þ þ 3 £ SaðresÞ þ 2 £ Saðresþ1Þ þ Saðresþ2Þ Where S is the PSI-BLAST profile matrix score for amino acid a at residue res. When the smoothed profile-derived alignment scores are plotted against increasing residue number the jagged peaks and troughs of the individual scores become much smoother (Figure 9(b)). Prediction of reliably aligned sections Reliably aligned regions were determined from the smoothed profile-derived alignment scores. Regions considered to be reliably aligned were continuous strings of residues that had a peak of profile-derived alignment score of greater than 4.0 for at least two residues and tails where the score did not fall below 2.0 (see Figure 9(c)). Measuring alignment accuracy against structural alignments Measuring alignment accuracy is not a trivial task. There is no such thing as a “true” alignment except where sequences are highly conserved. One solution is to use structural alignments as the standard to evaluate sequence alignments. Alignments based on structural features will be more reliable than those based on sequence alone, since structure is seen to be much more highly conserved than sequence.39 A good structural alignment ought, therefore, to be the best approximation of the “correct” biological alignment when dealing with remotely homologous sequences. However, structural alignments, while more reliable than sequence alignments, can be method dependent. Different methods can produce differences in alignment for the same pair of structures, especially if the two structures are only remotely related.40 Sauder et al.4 compared alignments produced by the structural alignment algorithm CE41 with alignments produced by DALI and

717

Predicting Reliably Aligned Regions

Figure 10. SSAP structural alignment of 1wdcB (bold, top) and 1wdcC (lower bold). The first and fifth lines (in italics) indicate the structural characterisation of the two sequences, the central line is the strength line, an indication of the strength of the structural alignment between two residues.

The SSAP structural alignments

Figure 9. Profile and alignment scores for the 3DPSSM alignment of 1aac and 1plc. (a) The raw PSI-BLAST profile score plotted against residue number for each aligned residue, starting from the N-terminal end of the alignment; (b) profile-derived alignment score smoothed over a five residue window plotted against residue number; (c) profile-derived alignment score plotted against residue number, those regions of the alignment predicted as reliable are shown in black.

Structural alignments were generated with the structural comparison program SSAP for each of the pairs in the benchmark set. SSAP compares protein structures in 3-D space by looking for matching structural environments and returns a score between 0 and 100 for each structural alignment. This indicates the degree of structural similarity, where 0% is no structural similarity and 100% indicates complete structural agreement. A score greater than 70 is supposed to show that the two structures have the same topology and the higher the reliability score the better the quality of the alignment. The SSAP structural alignment reliability score was greater than 70 for all the pairs in the benchmark set. SSAP structural alignments include an indication of the confidence of the structural alignment at each residue. The “strength” of the alignment between each pair of aligned residues in an SSAP alignment is indicated by a series of symbols (see Figure 10). The strength of each residue pairing is based on the comparison of the structural environments of the two residues and is reported as a series of symbols and spaces in the central line of the alignment. The @ symbol denotes the highest scoring pairings, while · is the lowest scoring symbol. The aligned residues without any symbol in the strength line are those that are aligned with the least confidence. As previously mentioned, alignment reliability was only evaluated against those positions in the SSAP alignment with a symbol in the strength line (the strongly aligned SSAP residues).

Acknowledgements found that at sequence identities between 10% and 15% they were only identically aligned at 75% of the residues. SSAP structural alignments were chosen to represent the “correct” alignments in this study in order to get around this problem. Since SSAP alignments include an indication of the strength of the structural alignment at each aligned pair of residues, it was possible to evaluate the accuracy of the alignments generated by the test methods against just those residues that SSAP aligned with confidence. Some 16.9% of the aligned residues generated by SSAP for the benchmark set had no strength symbol, so the test method alignments were evaluated against just 83.1% of the structural alignments.

The authors thank Richard Mott for the initial suggestion and also Sharon Berryman, Federico Abascal and Manuel Gomez, without whom none of this would work have been possible.

References 1. Tramontano, A., Leplae, R. & Morea, R. (2001). Analysis and assessment of comparative modeling predictions in CASP4. Proteins: Struct. Funct. Genet. Suppl. 5, 22 –38.

718

Predicting Reliably Aligned Regions

2. Rost, B. (1999). Twilight zone of protein sequence alignments. Protein Eng. 12, 85 – 94. 3. Domingues, F. S., Lackner, P., Andreeva, A. & Sippl, M. J. (2000). Structure based evaluation of sequence comparison and fold recognition alignment accuracy. J. Mol. Biol. 297, 1003– 1013. 4. Sauder, J. M., Arthur, J. W. & Dunbrack, R. L. (2000). Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins: Struct. Funct. Genet. 40, 6 – 22. 5. Blake, J. D. & Cohen, F. E. (2001). Pairwise sequence alignment below the twilight zone. J. Mol. Biol. 307, 721– 735. 6. Jaroszewski, L., Rychlewski, L. & Godzik, A. (2000). Improving the quality of twilight-zone alignments. Protein Sci. 9, 1487– 1496. 7. Altschul, S. R., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389– 3402. 8. Elofsson, A. (2002). A study on how to best align protein sequences. Proteins: Struct. Funct. Genet. 46, 300–309. 9. Venclovas, C. (2001). Comparative modeling of CASP4 target proteins: combining results of sequence search with three-dimensional structure assessment. Proteins: Struct. Funct. Genet. Suppl. 5, 47–54. 10. Vingron, M. & Argos, P. (1990). Determination of reliable regions in protein sequence alignments. Protein Eng. 3, 565– 569. 11. Chao, K.-M., Hardison, R. C. & Miller, W. (1993). Locating well-conserved regions within a pairwise alignment. Comput. Appl. Biosci. 9, 387– 396. 12. Mevissen, H. T. & Vingron, M. (1996). Quantifying the local reliability of a sequence alignment. Protein Eng. 9, 127– 132. 13. Zhang, Z., Berman, P., Wiehe, T. & Miller, W. (1999). Post-processing long pairwise alignments. Bioinformatics, 15, 1012– 1019. 14. Jaroszewski, L., Li, W. & Godzik, A. (2002). In search for more accurate alignments in the twilight zone. Protein Sci. 11, 1702– 1713. 15. Cline, M., Hughey, R. & Karplus, K. (2002). Predicting reliable regions in protein sequence alignments. Bioinformatics, 18, 306– 314. 16. Schlosshauer, M. & Ohlsson, M. (2002). A novel approach to local reliability of sequence alignments. Bioinformatics, 18, 847– 854. 17. Needleman, S. B. & Wunsch, C. D. (1970). An efficient method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48, 443– 453. 18. Kelley, L. A., MacCallum, R. M. & Sternberg, M. J. E. (2000). Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol. 299, 499– 520. 19. Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucl. Acids Res. 22, 4673–4680. 20. Jones, D. T. (1999). GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 287, 797– 815. 21. Schaeffer, A. A., Wolf, Y. I., Ponting, C. P., Koonin,

22. 23. 24. 25. 26.

27. 28.

29.

30. 31. 32. 33. 34. 35. 36. 37.

38. 39. 40. 41.

E. V., Aravind, L. & Altschul, S. F. (1999). IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics, 15, 1000– 1011. Karplus, K., Barrett, C. & Hughey, R. (1998). Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14, 846– 856. Casari, G., Sander, C. & Valencia, A. (1995). A method to predict functional residues in proteins. Nature Struct. Biol. 2, 171– 178. Lichtarge, O. & Sowa, M. E. (2002). Evolutionary predictions of binding surfaces and interactions. Curr. Opin. Struct. Biol. 12, 21 – 27. Taylor, W. R. & Orengo, C. A. (1989). A holistic approach to protein structure alignment. J. Mol. Biol. 208, 1 – 22. Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577– 2637. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, F., Bhat, T. N., Weissig, H. et al. (2000). The Protein Data Bank. Nucl. Acids Res. 28, 235– 242. Jo¨rnvall, H., Persson, B., Krook, M., Atrian, S., Gonzalez-Duarte, R., Jeffery, J. & Ghosh, D. (1995). Short-chain dehydrogenases/reductases (SDR). Biochemistry, 34, 6003– 6013. Filling, C., Berndt, K. D., Benach, J., Knapp, S., Prozorovski, T., Nordling, E. et al. (2002). Critical residues for structure and catalysis in short-chain dehydrogenases/reductases. J. Biol. Chem. 277, 25677–25684. Vingron, M. (1996). Near-optimal alignment. Curr. Opin. Struct. Biol. 6, 346– 352. Todd, A. E., Orengo, C. A. & Thornton, J. M. (2002). Plasticity of enzyme active sites. Trends Biochem. Sci. 27, 419– 426. Devos, D. & Valencia, A. (2000). Practical limits of function prediction. Proteins: Struct. Funct. Genet. 41, 98 – 107. Holm, L. & Sander, C. (1996). Mapping the protein universe. Science, 273, 595– 602. Holm, L. & Sander, C. (1998). Touring protein fold space with DALI/FSSP. Nucl. Acid Res. 26, 316– 319. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of proteins database. J. Mol. Biol. 247, 436– 540. Hadley, C. & Jones, D. T. (1999). A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. Structure, 7, 1099– 1112. Sander, C. & Schneider, R. (1991). Database of homology derived protein structures and the structural meaning of sequence alignment. Proteins: Struct. Funct. Genet. 9, 56 – 68. Holm, L. & Sander, C. (1998). Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics, 14, 423– 429. Chothia, C. & Lesk, A. M. (1986). The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823– 826. Godzik, A. (1996). The structural alignment between two proteins: is there a unique answer? Protein Sci. 5, 1325– 1338. Shindyalov, I. N. & Bourne, P. E. (1998). Protein alignment by incremental combinatorial extension (CE) of the combinatorial path. Protein Eng. 11, 739–747.

Edited by J. Thornton (Received 3 December 2002; received in revised form 28 April 2003; accepted 9 May 2003)