doi:10.1016/S0022-2836(03)00318-8
J. Mol. Biol. (2003) 328, 567–579
Motif Refinement of the Peroxisomal Targeting Signal 1 and Evaluation of Taxon-specific Differences Georg Neuberger1*, Sebastian Maurer-Stroh1, Birgit Eisenhaber1 Andreas Hartig2 and Frank Eisenhaber1* 1
Research Institute of Molecular Pathology Dr. Bohrgasse 7, A-1030 Vienna, Austria 2
Institut fu¨r Biochemie und Molekulare Zellbiologie Dr. Bohrgasse 9, A-1030 Vienna, Austria
Eukaryote peroxisomes, plant glyoxysomes and trypanosomal glycosomes belong to the microbody family of organelles that compartmentalise a variety of biochemical processes. The interaction between the PTS1 signal and its cognate receptor Pex5 initiates the major import mechanism for proteins into the matrix of these organelles. Relying on the analysis of amino acid sequence variability of known PTS1-targeted proteins and PTS1-containing peptides that interact with Pex5 in the yeast two-hybrid assay, on binding site studies of the Pex5– ligand complex crystal structure, 3D models and sequences of Pex5 proteins from various taxa, we derived the requirements for a C-terminal amino acid sequence to interact productively with Pex5. We found evidence that, at least the 12 C-terminal residues of a given substrate protein are implicated in PTS1 signal recognition. This motif can be structurally and functionally divided into three regions: (i) the C-terminal tripeptide, (ii) a region interacting with the surface of Pex5 (about four residues further upstream), and (iii) a polar, solvent-accessible and unstructured region with linker function (the remaining five residues). Specificity differences are confined to taxonomic subgroups (metazoa and fungi) and are connected with amino acid type preferences in region 1 and deviating hydrophobicity patterns in region 2. q 2003 Elsevier Science Ltd. All rights reserved
*Corresponding authors
Keywords: peroxisome; PTS1; subcellular localization; protein sequence motif
Introduction Peroxisomes are ubiquitous organelles of eukaryotic cells and belong to the microbody family of organelles together with plant glyoxysomes and glycosomes of trypanosomes.1 Microbodies are involved in a variety of biochemical processes, with enzyme spectra varying between microbody types.2 Two classes of targeting signals for peroxisomal matrix proteins have been characterised, termed PTS13 and PTS2,4,5 and the existence of additional ones has been proposed.6 – 8 Proteins carrying one of these signals are recognised in the cytosol by soluble PTS receptors, Pex5 for PTS1 proteins9,10 and Pex7 for PTS2 proteins.11 – 13 Whereas the PTS2 signal is assumed to consist of a nonapeptide located near the N terminus of a Abbreviation used: PTS1/PTS2, peroxisomal targeting signal 1/2. E-mail addresses of the corresponding authors:
[email protected];
[email protected]
given target protein,14,15 the PTS1 sequence has been characterised as a C-terminal consensus tripeptide more than a decade ago.16 The 3D structure of a PTS1 containing pentapeptide complexed to the human Pex5 receptor protein has been resolved.17 Differences in the recognition of the PTS1 signal between taxonomical subgroups have been demonstrated.18 Furthermore, it has been shown that the C-terminal consensus tripeptide is by far more variable than initially assumed. Even the occurrence of residues that do not fall within the tiny-positive-leucine consensus can lead to functional PTS1 signals. In this context, upstream residues seem to influence the targeting efficiency of the C-terminal tripeptide.18 Since the variety of C-terminal tripeptides has become significantly larger due to new sequence examples and upstream positions have been shown to play an integral role in substrate recognition, the description of the PTS1 signal as a C-terminal consensus tripeptide with specificitymodulating upstream residues is obviously insufficient. In this work, we revised the requirements
0022-2836/03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved
568
PTS1 Motif Refinement
for a C-terminal amino acid sequence to function as a PTS1, using the available data on sequence variability of Pex5-recognised protein substrates as well as Pex5 sequences and structures.
Results and Discussion Outline of the motif evaluation As a preliminary step, we constructed a heterogeneous database composed of probable and experimentally verified PTS1 sequences from SWALL19 (SW set, 205 sequences) and fusion constructs containing PTS1 C termini tested for interaction with Pex5 in the yeast two-hybrid system (LH set, 150 sequences).18 The learning set has a total size of 355 sequences, with 211 and 72 entries belonging to the metazoan and fungal taxonomic groups, in addition to 72 sequences from plants and protozoa. For the calculation of relative amino acid type occurrences at C-terminal sequence positions, balancing for uneven representation of protein families is necessary. For internal control, this procedure was performed in two ways: either by using the PSIC method,20 or by calculating a largest subset of unrelated sequences.21,22 The learning set entries were analysed with regard to physico-chemical requirements over single and multiple positions using a database of almost 700 amino acid properties23,24 such as hydrophobicity, charge or flexibility. The obtained results were interpreted using the 3D structure of a PTS1 containing pentapeptide complexed to its receptor molecule Pex5 in human,17 in addition to a model of the PTS1 –yeast Pex5 complex. The learning set construction and analysis details are described in Methodological Details. Supplementary material to the obtained results is presented†. Motif length determination As a first step towards the characterisation of the PTS1 import signal, its true motif length had to be evaluated. As criterion, we searched for C-terminal sequence ranges that clearly deviate in their physical properties and amino acid type occurrences from average C termini in SWISSPROT 40. The initial motif length determination is the more important, as the LH set sequences are peptides of short and variable length with the same Gal4 sequence appended at each N terminus (see Methodological Details). Therefore, it is necessary to expel any peptide shorter than the motif length in order to avoid a compositional bias caused by included Gal4 sequence stretches. Motif specific deviations from average C termini over many sequence positions were obtained by analysing amino acid type occurrence correlations, for example, with the properties NAKH920106 (AA composition of the cytosolic parts of multi† http://mendel.imp.univie.ac.at/PTS1/
Figure 1. Deviation of selected physical properties from the SWISSPROT average.
spanning proteins)25 on the SW learning set, and ISOY800101 (Normalised relative frequency of a-helix)26 on all sequences from the LH set having full length (16 residues) C-terminal random peptides (Figure 1). For the description of C-terminal sequence positions, here and throughout the text we use a numbering i relative to the total sequence length n: Thus, n þ i is the actual position of the residue in the sequence; for example, i ¼ 0 is the C-terminal residue and i ¼ 22 is the most N-terminal residue of the C-terminal tripeptide. The property scale NAKH920106 correlates with the average amino acid composition of proteins but has a tendency to higher values for many small and/or hydrophilic residues. Therefore, plotting the property NAKH920106 as function of sequence position (Figure 1(a)) measures the preference for hydrophilic or hydrophobic residues in that region. The general appearance of the plot indicates, for the SW learning set, a tendency for hydrophilic residues upstream until around position 2 11. Not surprisingly, this result cannot
569
PTS1 Motif Refinement
Table 1. Representative selection of significant correlation coefficients for single positions Position
Property
r
Metazoan sequences from the SW learning set (96 entries) M1 23 RACS770103 M2 24 NAKH900109 M3 24 ZVEL_ALI_2 M4 24 ONE_AA__A_ M5 25 NAKH920104
0.61 0.63 0.70 0.82 0.67
M6 M7 M8
26 26 27
KARP850103 ONE_AA__K_ NAKH920101
0.65 0.78 0.65
M9
28
NAKH920106
0.62
M10
29
NAKH920104
0.77
M11 M12
210 210
KARP850101 EISD860101
0.79 20.73
Non-consensus metazoan sequences from the LH learning set (50 entries)18 M13 23 ZVEL_CH_P1 0.74 M14 23 ONE_AA__R_ 0.92 M15 24 NAKH920105 0.73 M16 M17 M18
24 24 26
ZVEL_ALI_1 ONE_AA__L_ LEVM760104
Fungal Sequences from the SW learning set (37 entries) F1 23 ZVEL_CH_P1 F2 23 ONE_AA__K_ F3 23 KARP850103 F4 24 NAKH920103
0.68 0.86 0.68 0.62 0.77 0.73 0.65
F5 F6
25 26
ARGP820101 NAKH920101
20.75 0.74
F7
27
NAKH920106
0.61
F8 F9
28 28
KARP850101 NAKH920103
0.63 0.61
F10
29
NAKH920103
0.63
F11
210
KARP850103
0.68
Non-consensus fungal sequences from the LH learning set (20 entries)18 F12 23 ZVEL_CH_P1 0.71 F13 23 ONE_AA__K_ 0.79 F14 24 ONE_AA__R_ 0.97 F15 25 ZIMJ680102 20.66 F16 26 EISD860101 20.64 All sequences from the SW learning set (205 entries) A1 27 NAKH920101
0.73
A2
29
NAKH920107
0.71
A3 A4
29 210
FAUJ880103 NAKH920103
20.67 0.69
A5
211
NAKH920106
0.67
be fully reproduced for the LH learning set sequences since their C-terminal segments are apparently not a part of the Gal4 structure and can be assumed accessible anyhow. The mean hydrophobicity level generally fluctuates closely around the SWISSPROT average from position 2 5 on (data not shown). However, the tendency to
Description
Side-chain orientational preference48 AA composition of the membrane proteins49 Aliphatic residues50 Single amino acid, alanine AA composition of the exterior part of single-spanning proteins25 Flexibility parameter for two rigid neighbours51 Single amino acid, lysine AA composition of the cytosolic part of single-spanning proteins25 AA composition of the cytosolic parts of multi-spanning proteins25 AA composition of the exterior part of single-spanning proteins25 Flexibility parameter for no rigid neighbour51 Solvation free energy32 Positive charge, without histidine50 Single amino acid, arginine AA composition of the membrane parts of single-spanning proteins25 Aliphatic residues50 Single amino acid, leucine Side-chain torsion angle f52 Positive charge, without histidine50 Single amino acid, lysine Flexibility parameter for two rigid neighbours51 Amino acid composition of the exterior part of singlespanning proteins25 Hydrophobicity index53 AA composition of the cytosolic part of single-spanning proteins25 AA composition of the cytosolic parts of multi-spanning proteins25 Flexibility parameter for no rigid neighbour51 AA composition of the exterior part of single-spanning proteins25 AA composition of the exterior part of single-spanning proteins25 Flexibility parameter for no rigid neighbour51 Positive charge, without histidine50 Single amino acid, lysine Single amino acid, arginine Bulkiness54 Solvation free energy32 AA composition of the cytosolic part of single-spanning proteins25 AA composition of the exterior parts of multi-spanning proteins25 Normalised Van der Waals volume55 AA composition of the exterior part of single-spanning proteins25 AA composition of the cytosolic parts of multi-spanning proteins25
form a-helices seems to be reduced in the region ranging from position 2 4 to positions 2 12 or 2 14 (Figure 1(b)). Hence, the upstream region is preferably without intrinsic structural preference and can adopt an extended, more flexible conformation. As a consequence, we assume that the upstream boundary of the motif lies between positions 2 11
570
PTS1 Motif Refinement
Table 2. Representative selection of significant F-values for combinations of two and three positions Positions
Property
F-value
Significance (%)
Metazoan sequences from the SW learning set (largest subset of 36 entries) M1 26/24 EISD840101 2.07 98.3 M2 26/24 GRAR740102 2.06 98.2 M3 26/24 VINM940101 1.82 96.0 M4 25/24/22 CIDH920105 2.03 98.0 M5 25/24/22 VINM940102 1.92 97.1 M6 M7 M8 M9 M10
26/24/21 26/24/22 26/25/24 26/25/24 28/25/24
GRAR740102 EISD860101 PONP800101 BIOV880101 GN__CHARGE
1.88 1.86 1.80 1.79 2.08
96.7 96.5 95.7 95.5 98.3
Non-consensus metazoan sequences from the LH learning set (50 entries)18 M11 23/21 EISD860101 1.72 97.0 M12 23/21 TANS770106 1.64 95.7 M13 23/22/21 EISD860101 1.95 98.9 M14 23/22/21 TANS770106 1.90 98.7 M15 25/24/22 VINM940103 1.82 98.1 M16 M17 M18 M19
26/23/21 26/25/23 26/25/24 26/25/24
EISD860101 BHAR880101 BIOV880102 CIDH920104
1.64 1.64 1.72 1.65
95.7 95.7 97.0 95.9
Fungal sequences from the SW learning set (largest subset of 30 entries) F1 26/23 COHE430101 2.05 97.1 F2 26/25/24 CIDH920101 1.99 96.6 F3 26/25/24 VINM940101 1.87 95.1 F4 26/24/23 SWER830101 1.92 95.8 F5 26/24/21 VINM940104 2.03 96.9 Non-consensus fungal sequences from the LH set (20 entries)18 F6 23/22/21 ARGP820103 2.17 F7 26/23 GRAR740103 2.61 F8 24/23/22 ARGP820103 3.55 F9 26/24/22 KARP850103 2.26 F10 26/24/22 ZVEL_HYDP2 2.45 F11 26/25/23 NAKH920102 2.24
95.0 97.9 99.6 95.8 97.1 95.7
F12 F13 F14
97.1 99.6 96.4
25/22/21 26/24/23 28/26/23
LEVM760101 GN__CHARGE GN__CHARGE
2.45 3.59 2.34
Description
Consensus normalised hydrophobicity scale56 Polarity57 Normalised flexibility parameters (B-values), average58 Normalised average hydrophobicity scales59 Normalised flexibility parameters (B-values), no rigid neighbour58 Polarity57 Solvation free energy32 Surrounding hydrophobicity in folded form60 Information value for accessibility; average fraction 35%61 Net charge Solvation free energy32 Frequency of chain reversal D62 Solvation free energy32 Frequency of chain reversal D62 Normalised flexibility parameters (B-values), one rigid neighbour58 Solvation free energy32 Average flexibility indices63 Information value for accessibility; average fraction 23%61 Normalised hydrophobicity scales for a/b-proteins59 Partial specific volume64 Normalised hydrophobicity scales for a/b-proteins59 Normalised flexibility parameters (B-values), average58 Optimal matching hydrophobicity31 Normalised flexibility parameters (B-values), two rigid neighbours58 Membrane-buried preference parameters53 Volume57 Membrane buried preference parameters53 Flexibility parameter for two rigid neighbours51 Hydrophobic amino acids50 AA composition of the exterior part of single-spanning proteins25 Hydrophobic parameter52 Net charge Net charge
All taxa, sequences obtained from the SWALL database (largest subset of 81 entries) A1 211/29/28 GRAR740102 1.53 97.1 Polarity57 A2 211/29/26 VINM940102 1.49 96.2 Normalised flexibility parameters, no rigid neighbour58 A3 210/28/26 EISD860101 1.59 98.0 Solvation free energy32 A4 210/28/26 VINM940102 1.49 96.2 Normalised flexibility parameters, no rigid neighbour58 A5 29/28/27 CIDH920104 1.48 95.9 Normalised hydrophobicity scales for a/b-proteins59 A6 29/28/25 EISD860101 1.49 96.2 Solvation free energy32 A7 28/26/24 VINM940101 1.47 95.7 Normalised flexibility parameters, average58 A8 28/26/23 ARGP820101 1.59 98.0 Hydrophobicity index53
and 2 14. For naturally occurring proteins, positions 2 12 to 2 14 appear not to favour hydrophilic residues (data not shown), making it less likely that the true PTS1 signal reaches further than position 2 11. The occurrence of a more hydrophobic sequence stretch indicates the beginning of the globular structure of the protein. Thus, measurable restrictions in amino acid type variability reach upstream at least up to position 2 11. Singleand multiple-position correlations using hydrophobicity related properties can be observed for both SW and LH sets in this region (Tables 1 and 2). Since the deviation from average C termini becomes small further N-terminally, we take this
position as motif boundary, resulting in a PTS1 motif consisting of 12 residues. However, we cannot completely exclude a minor influence of positions even further upstream. Considering the 3D structure of a PTS1 pentapeptide complexed to human Pex517 supports this hypothesis. If we assume that the PTS1 binding site on Pex5 is composed of a monomeric receptor molecule, a motif length of 12 residues indeed makes sense. The upstream-most residue (position 2 4) of the PTS1 pentapeptide in the 3D structure is still in vicinity of the Pex5 receptor molecule. Supposing that a few adjacent residues still interact with the Pex5 surface, and that several amino acids
571
PTS1 Motif Refinement
Figure 2. Sequence logo47 for the 12 C-terminal residues of PTS1 containing proteins. The sequence sets used for this purpose were composed of the 12 residue long LH sequences and a largest subset of SW proteins (see Methodological Details) from (a) metazoa (136 sequences), (b) fungi (61 sequences) and (c) all taxa taken together (212 sequences). Generated by the weblogo server at http://weblogo.berkeley.edu/ using default settings.
vicinity of the PTS1 C-terminal leucine in the Homo sapiens Pex5 structure (retrieved from the Protein Data Bank, structure accession: 1FCH) and in the 3D model of its Saccharomyces cerevisiae homologue (automatically generated by the SwissPdb server,28 – 30 Figure 3). These sequence regions, three a-helices, are shown in an alignment of known Pex5 sequences (Figure 4) together with the 6 ligand contact positions that differ between human and S. cerevisiae. The substitution ofAsn462 (in human, other mammalia, plants and trypanosomatidae) by a tyrosine (in fungi) is the most pronounced difference between taxa. Together with the isoleucine substitution for Thr377, this change is suggested to explain the taxon-specific tolerance of terminal methionine (mammalia) and phenylalanine (yeast). It should be noted that in the available crystal structure, the H. sapiens Pex5 – PTS1 complex appears in dimeric form, with one binding site oriented towards the second Pex5 molecule and the other one towards the solvent. For all structural considerations throughout this work, we used only the Pex5 –PTS1 complex with a binding cavity that points towards the solvent. Variations of the hydrophobicity level at position 2 1 can be compensated by positions 2 3 and 2 2 (Table 2 rows M13 – M14, F6). Obviously, this region as a whole should be sufficiently polar. This hypothesis is supported by the fact that no sequence in the total learning set contains three consecutive non-hydrophilic residues at positions 2 3, 2 2 and 2 1. The region interacting with the surface of Pex5: positions 2 3 to 2 6
further upstream serve to separate the ligand core from the C-terminal signal, the obtained number of critical residues falls within a reasonable order of magnitude and is in agreement with our experience from other terminal signals.23,27 The occurrence of distinct regions within the motif can also be visualised using a sequence logo (Figure 2). Amino acid type preferences appear very pronounced within the three C-terminal positions, and the importance of the first and second amino acids within this tripeptide seem to differ between metazoa and fungi. The 2 – 3 adjacent positions are less specific, but amino acid preferences are still detectable. The sequence stretch further upstream is not characterised by a preference for very few amino acid types. The regional differences will be further elaborated below. The C-terminal tripeptide and its correlation with position 2 3 For positions 2 2, 2 1 and 0, we found the consensus “tiny-positive-leucine” being prevalent in the SW set as well as taxon-specific tolerances for methionine in metazoa and phenylalanine in fungi for substitutions of the ultimate leucine (see also Figure 2). We analysed the residues in spatial
At position 2 3, positively charged residues are favoured for all taxa (Table 1 rows M13, F1 and F12). Metazoan sequences prefer arginine at position 2 3 (Table 1 row M14), whereas lysine (Table 1 rows F2, F13) has a privileged occurrence in fungi. Fungi also favour flexible residues (Table 1 row F3). A preference for hydrophilic residues was observed in metazoan sequences (Table 1 row M1). Fungal and metazoan proteins require a separate discussion for this region. We start with the latter. In accordance with Lametschwandtner et al.,18 we find a preference for leucine at position 2 4 among metazoan LH set sequences (Table 1 row M17). In contrast, SW set proteins seem to favour alanine rather than leucine (Table 1 rows M4). The true requirements for position 2 4 in the PTS1 signal are apparently more general: for both SW and LH learning sets, a preference for hydrophobic and especially aliphatic residues at position 2 4 was detected (Table 1 rows M2, M3, M15 and M16). A preference for hydrophilic residues at position 2 5 was observed for the SW learning set (Table 1 row M5). At position 2 6, single position correlations point towards steric requirements, a need for flexibility and a preference for lysine (Table 1 rows M18, M6 and M7). Fisher analysis on
572
PTS1 Motif Refinement
Figure 3. Differences in the binding cavities of human and yeast (modelled 3D structure) Pex5 lead to shifted amino acid preferences at position 0 (generated with SwissPdbViewer29). H. sapiens residues ¼ green, S. cerevisiae residues ¼ violet.
combinations of two and three residues including positions 2 6 and/or 2 5 yields numerous significant F-values for both SW (Table 2 rows M1 – M9) and LH sets (Table 2 rows M15 –M19) using hydrophilicity and flexibility related properties. Positions 2 5 and 2 6 are more hydrophobic in LH set proteins than in SW set sequences. If we average the hydrophobicity levels at both positions over the respective two sets, then the value of the LH set is, as a tendency, higher than the result for the SW set, regardless of the hydrophobicity scale used (scales SWER830101 (optimal matching hydrophobicity)31 and EISD860101 (solvation free
energy)32). The difference is . 0.5 SWISSPROT standard deviations for the respective scales. The entirety of physical properties characteristic for positions 2 3 to 2 6 can be interpreted in terms of interactions with the Pex5 surface. As a whole, the region ranging from position 2 6 to the C terminus should retain a certain amount of flexibility and hydrophilicity. Position 2 3, preferably occupied by positively charged or, at least, hydrophilic residues, points towards polar regions at the exit of the PTS1-binding cavity (Figure 5). Its spatial vicinity to residue 2 1 explains the numerous significant F-values obtained for hydrophilicity
Figure 4. Multiple alignment of Pex5 sequences generated and coloured by clustalx. The entries were retrieved from the SWALL database at http://srs.ebi.ac.uk/. The residues on top of the alignment refer to human Pex5 and are those ˚ distance to the g-C-atom of the PTS1 C-terminal leucine and differ between H. sapiens and S. cerevisiae. that are in 6 A
PTS1 Motif Refinement
573
Figure 5. Spatial arrangement of PTS1 residues 2 1/ 2 3 and human Pex5 amino acids asparagine 378 and glutamate 379 (generated with SwissPdbViewer29).
related properties (Table 2 rows M13 – M14), also valid for fungal proteins (Table 2 row F6). Human Pex5 residues Ile527, Ile530, Leu572 and Met576 form a hydrophobic surface in spatial vicinity of PTS1 position 2 4 (Figure 6), explaining the preference for hydrophobic residues in both learning sets. The additional requirement for flexibility is best represented by aliphatic, non-bbranched amino acid types. Hence, leucine at position 2 4 may be the ideal residue for peptides with weak C-terminal tripeptides, providing enough stabilising hydrophobic side-chain surface. If the strength of hydrophobic interaction is not
critical as in the SW set sequences (see Methodological Details), alanine as more flexible residue is preferred. PTS1 positions 2 6 and 2 5 appear to have a dual role. They might assist position 2 4 in complex formation with hydrophobic residues if a weakly binding C-terminal tripeptide is present. With more canonical C-terminal tripeptides, their role in providing backbone flexibility seems increased. In the case of fungal proteins, single position correlations (Table 1 rows F4 –F6 and F14 –F16) indicate a preference for hydrophilic and flexible residues in the entire sequence stretch ranging
˚ of the tyrosine g-C-atom at PTS1Figure 6. Residues of human Pex5 that are located at least partly within 6 A position 2 4 (generated with SwissPdbViewer29).
574
PTS1 Motif Refinement
Figure 7. Average (a, b) and negative (c, d) charge between positions 211 and 0. Fungal and metazoan learning sets compared to SWISSPROT 40.
from positions 2 3 to 2 6. Several significant F-values regarding these properties (Table 2 rows F1 –F5 and F7– F12) reveal strong inter-positional correlations in conjunction with downstream positions 2 2 and 2 1. In contrast to metazoa, no indications towards hydrophobic requirements in this region could be detected. Hence, residues within this region seem to be orientated towards polar regions in Pex5 or towards the solvent. Similarly to metazoa, the sequence stretch should retain a certain amount of flexibility. The role of charge: positions 2 3 to 2 8 The preference for specifically charged residues seems to be not only restricted to residues 2 3 and 2 1. A dispreference for negatively charged residues and preference for positive charge can be observed in the sequence stretch ranging up to position 2 8 for metazoan sequences and for the fungal LH set (Figure 7 and Table 2 rows M10, F13 and F14). In the case of fungal sequences from SWALL, appearing negative charges are compensated by positive charges resulting in a neutral average charge.
Providing solvent accessibility and backbone flexibility: positions 2 7 to 2 11 Although PTS1 positions do not reveal clear amino acid type preferences here, the sequence is not completely random and shows similarities to the extracellular and cytosolic parts of transmembrane proteins (Table 1 rows M8 –M12, F7 – F11). Since requirements were found to be similar between fungi and metazoa, we analysed a unified set of proteins containing a largest subset of SW entries from all taxa. Significant single position correlation coefficients (Table 1 rows A1 –A5) confirm the detected preferences. Additionally, numerous significant F-values were obtained for different sets of positions within this sequence stretch. These point towards inter-positional hydrophobicity and flexibility compensations (Table 2 rows A1 –A8). The definition of a boundary at positions 2 6/ 2 7 is based on two considerations: (i) more C-terminal residues have potential for interaction with the surface of Pex5, (ii) a few residues N-terminally of the boundary should simply bridge the distance between the surface of Pex5
575
PTS1 Motif Refinement
and the substrate protein globule and, typically, not interact with the Pex5 surface. Although hydrophilic and flexible residues seem to be favoured in region 2 7 to 2 11, these preferences are not strict. These five residues seem to be a minimal form of a linker region reminiscent of those encountered in previous characterisations of post-translational modification motifs.23,27 We suggest that this region makes the PTS1 signal accessible to the receptor by providing hydrophilicity and flexibility in conjunction with downstream positions. It cannot be excluded that, due to the size of the learning set and the limited sensitivity of property analysis, we underestimate the length of this linker region. a-helical or b-sheet elements in the C-terminal region might favour structural fixation and lower accessibility. The measured inter-positional correlations between amino acid type occurrences and secondary structure properties indicate a reduced intrinsic preference for secondary structures. For the LH set, the depression of a-helical preference measured with the ISOY80010126 scale was used for motif length determination. Similarly, F-values of 1.34 and 1.44 (95.2% and 98.1% significance) for positions 2 11/2 8/2 7 and 2 9/2 6/2 4 that were obtained using the property LEVM78010533 (normalised frequency of b-sheet) point to limited acceptance for b-strand structure in LH set sequences. The same trend is observed for the SW set. Inter-positional correlations in the range of F ¼ 1.46 for the property PALJ81010434 (normalised frequency of b-sheet, 95.4% significance) at positions 2 8/2 6/2 4 or F ¼ 1.45 for BURA74010135 (normalised frequency of a-helix, 95.1% significance) at positions 2 10/2 7/2 5 were obtained. We calculated the levels of secondary structure elements using the “PREDATOR” program36,37 on a largest subset of SW proteins (81 sequences) and all 12 residue-long LH sequences (131 entries). The program predicts a-helices or b-sheets for the LH sequences for less than three sequences (mostly 0) at any motif position. For the SW set, depression of secondary structures is observed for the nine C-terminal residues (predicted coil for more than 50% at the respective motif positions). Only at positions 2 12/2 13, the stationary level for protein globules is reached. Although we measured restrictions in amino acid type variability over the motif region 2 11…0, these sequences are typically too rich in amino acid types to be noticeable in sequence complexity calculations. With any of the three standard “SEG”38 parametrisations (12 – 2.2– 2.5, 25– 3.0 –3.3, 45– 3.4 –3.75), maximally two entries are hit. The detailed SEG and PREDATOR data can be found at the associated website.
Summary: PTS1 motif structure To summarise the results of the motif re-evaluation, we found that the PTS1 signal comprises
the 12 C-terminal residues of a targeted substrate protein. This sequence stretch can be structurally and functionally divided into three regions: (i) the C-terminal tripeptide which lies in the receptor protein binding cavity upon docking with Pex5, (ii) the region directly upstream (, four residues) which interacts with the surface of Pex5, (iii) a polar, solvent-accessible and unstructured region with linker function (the remaining ,five residues).
Methodological Details Generation of learning sets of PTS1 targeted proteins The sequence annotations in the SWALL database (May 2001) were searched for proteins that are targeted to peroxisomes. We looked for entries having the statement “peroxisome” in the keywords field or “peroxisomal” in the comments field. In the same manner, we added sequences annotated as “glyoxysomal” and “glycosomal”. In total, the raw database contained 416 entries. From this set, all peroxisomal membrane proteins as well as proteins with peroxisome related functions that are located in other subcellular compartments were removed. Thus, 223 peroxisomal matrix proteins were left. According to previous analyses of typical errors in database annotations,23,27,39 we investigated the reliability of the SWALL annotations regarding subcellular localisation and targeting of the learning set proteins. There was a discrepancy between the real states of experimental verification and the annotations in SWALL. Therefore, we investigated the original literature for each database entry and searched for experimental evidence certifying its PTS1 dependent transport. We chose all sequences with a reliable localisation in the peroxisomal matrix and excluded proteins that solely contained a putative or verified PTS2 signal. We additionally excluded the carnithine-octanoyl-transferases OCTC_BOVIN (O19094), OCTC_HUMAN (Q9UKG9) and OCTC_RAT (P11466) which were judged to be unlikely to be PTS1 targeted due to the high frequency of hydrophobic residues yielding a potentially inaccessible C terminus. These entries are annotated as “potentially” peroxisomal in SWISSPROT. Moreover, we added five hydroxymethylglutaryl-CoA lyases (Q29448, P35914, P38060, P97519, P35915) and five isocitrate dehydrogenases (O75874, Q9Z2K9, Q9Z2K8, O88844, P41562) that were not detected during the database search. These enzyme groups are annotated as “mitochondrial” and “cytoplasmic”, respectively, but their peroxisomal localisation has meanwhile been successfully demonstrated.40 – 42 The resulting training set (called “SW set”) contained a total of 205 entries but with uneven distribution among taxa (Table 3). In addition to the sequences from the SWALL database, we used a set of 150 oligopeptides (hexadecapeptides or shorter) that were obtained in a peptide library screen for interaction with C. elegans (S. Langer, L. Wabnegger and A. Hartig, unpublished, 50 sequences), S. cerevisiae (35 sequences) and H. sapiens (65 sequences) Pex5 in the yeast two hybrid system18 (called “LH set”). Such a peptide is fused to the C terminus of an artificial protein consisting of 114 residues from the activation domain of Gal4 (amino acids 768– 881)43 plus
576
PTS1 Motif Refinement
Table 3. Sequences included in the learning set of entries retrieved from the SWALL database (SW learning set) Taxon Metazoa Fungi Plant Other Total
PTS1 and localisation verified
Localisation verified
Total
8 6 1 1 16
36 19 10 6 71
96 37 61 11 205
seven linker residues (only the PTS1 test peptides are considered sequences of the LH set). The inclusion of these sequences yields a heterogeneous database consisting of two sequence sets (LH and SW) with different characteristics. In the yeast two hybrid screen, emphasis was put on weakly interacting oligopeptides, since the authors wanted to identify unusual C-terminal tripeptides. Positions upstream of the three C-terminal positions have been supposed to be more critical if the last three residues differ considerably from the consensus.18 Therefore, we assume that requirements regarding upstream positions are strongly reflected by sequences in the LH set harbouring C-terminal tripeptides that do not fully match the consensus. Due to the construction plan of the LH set peptides, it is highly unlikely that the C termini are buried. Hence, C-terminal accessibility is supposed to be generally provided and might be insufficiently reflected in the C-terminal sequence with regard to flexibility and hydrophilicity requirements. On the other hand, the naturally occurring sequences of the SW set are expected to broadly represent the overall residual and physical requirements of the PTS1 signal. However, the canonical {SA}-{KR}-L tripeptides appear overrepresented, as numerous PTS1 targeted proteins have been annotated by homology or other theoretical considerations. In spite of these fundamental differences, we expect that including the LH set sequences into the database benefits the motif description in three ways: (i) the overall database size is nearly doubled, (ii) the information content regarding tolerance of non-canonical residues within the C-terminal tripeptide is increased, and (iii) physico-chemical requirements upstream of this tripeptide appear more pronounced. However, it is clear that statistical analyses of the LH and SW set sequences have to be performed and interpreted separately.
Extraction of the PTS1 recognition pattern: concept of analysis The calculation of relative occurrences pða; iÞ of residue types a at given motif positions i is complicated by the heterogeneity of the learning set. We separately used the gapless multiple alignments of (i) the SW learning set, (ii) the LH learning set, and (iii) a unified learning set including all available sequences. The C-terminal region has to fit as a whole into the binding cavity of Pex5, thus, the concept of gaps as in loops between secondary structural elements is not applicable here. The issue of uneven sequence family representation was taken care of by (i) computing a largest subset of sequentially unrelated sequences21,22 (only for the SW learning set), and (ii) by applying the PSIC position and sequence-
specific weighting scheme20 (see below for both approaches). For each alignment position i studied, the correlation coefficients between the vector of 20 amino acid type frequencies (pðA; iÞ; pðC; iÞ; …; pðY; iÞ) and the vectors of amino acid indices quantifying physical properties of amino acid types were calculated. We used a database of almost 700 amino acid indices.23,24 If not stated otherwise, the effective residue composition (pðA; iÞ; pðC; iÞ; …; pðY; iÞ) was calculated using the PSIC approach. Single position requirements cannot describe possible correlation effects among sequence positions of the substrate. The sequence context is sometimes more critical than a unique disfavoured residue. Therefore, we analysed the pattern of interdependence between two and three alignment positions. The sums of squared variances of physico-chemical parameters of individual alignment columns were calculated and compared with the squared variances calculated for joint sets of columns. For a ratio F being larger than a critical value, the Fisher criterion suggests the existence of correlations among sequence positions (see below). To avoid a compositional bias from included Gal4 stretches in LH set sequence C termini, we utilised only those sequences that had full length (16 residues) C-terminal test-peptides when evaluating the motif length. In the same manner, we excluded all sequences from the LH learning set that were shorter than the obtained motif length (12 residues) for the evaluation of physical property patterns. This reduced LH set contained 131 sequences (31 S. cerevisiae, 54 H. sapiens and 46 C. elegans entries). Moreover, calculations involving residues upstream of the C-terminal tripeptide (“upstream positions”) were performed using an LH learning set containing only sequences with non-consensus C-terminal tripeptides (at least one position should not match the {SA}-{KR}-L consensus). In this case, the remaining LH set contained 20 S. cerevisiae, 29 human and 21 C. elegans sequences.
Division into taxonomical subgroups Unequal substrate requirements between different taxa should be taken into consideration when extracting a recognition pattern for a given motif. On the other hand, splitting up the learning set into taxonomical subgroups should not reduce the number of sufficiently diverse sequences (called “largest subset”, see below) in each taxonomic set too much. Fundamental differences regarding substrate specificity have already been detected between the H. sapiens and S. cerevisiae Pex5 receptor molecules.18 Hence, proteins from these species have to be separately analysed. We decided to split the learning sets at the kingdom level, resulting in metazoan, fungal, plant and “remaining” (euglenozoa, alveolata, mycetozoa) taxonomical subsets. The metazoan subset is the largest one with 36 entries from the SW set in addition to 65 human and 50 C. elegans sequences of the LH set. The second largest group comprises the fungal sequences, with a largest subset of 30 sequences from the SW set in addition to 35 C. cerevisiae peptides from the LH set. Although 61 plant sequences are included in the SW learning set, the respective largest subset contains only nine entries. Only a few enzyme groups from this kingdom are included, each represented by many homologous sequences (mostly catalases). In view of the taxonomic composition of our sequence sets, we
577
PTS1 Motif Refinement
considered that separate recognition pattern analyses should be made for fungal and metazoan proteins, but not for the remaining sequences due to the low amount of available information. Balancing for uneven representation of protein families Two different mechanisms have been used for balancing the representation of different classes of sequences in the alignment. First, the largest subset of sequences with maximal pairwise sequence identity below 30% (for the 30 C-terminal residues) has been determined following published algorithms.21,22 In the alternative approach PSIC (position-specific independent counts), all sequences contribute to the pða; iÞ computation, but with sequence and positionspecific weighting. An amino acid type a at a given alignment position i in a subset of the alignment provides the less new information compared with a single observation of a; the more similar the sequences in the subset are in other alignment positions.20 The methodology has been applied in the same manner as previously outlined.44,45 Both approaches essentially yield similar results. However, PSIC is able to use sequence information more efficiently and was, therefore, more sensitive. We used it if possible (single position correlations, Table 1). It should be noted that the correlations calculated from the largest subset follow the trends computed with the PSIC approach, but are generally smaller. Yet, the PSIC approach is not directly applicable for correlations over multiple positions due to the position-specific weighting. In these cases, we used the respective largest subsets to calculate F-values (Table 2). Significance of correlation coefficients Statistical significance of a correlation coefficient r is computed using: pffiffiffiffiffiffiffiffiffiffiffi r n2f ta k pffiffiffiffiffiffiffiffiffiffiffiffiffi ð1Þ 1 2 r2 for n , 100:46 In this equation, n is the number of data points (n ¼ 20 amino acid types) and f represents the number of conditions (two for the linear regression and one for the sum of all amino acid-type frequencies being unity). The threshold ta is the argument of the Student’s distribution for a one-sided criterion with the confidence level a (ta ¼ 3:222 for a ¼ 0:0025 and ta ¼ 4 for a ¼ 0:001). This results in r . 0:62 and r . 0:70; respectively. Fisher’s test The relation F between the sum of the squared variances s2i for independent sequence positions i1 ; i2 ; … and the squared variance sði1 ; i2 ; …Þ2 for putatively correlated positions: X s2i F¼
i
sði1 ; i2 ; …Þ2
ð2Þ
measures correlation among positions.46 The interdependence is statistically significant if F . Fcritical where the decision criterion Fcritical depends on the number of sequences over which the average was calculated.
Acknowledgements The authors are grateful for continuous support from Boehringer Ingelheim. This project has been partly funded by the Fonds zur Fo¨rderung der ¨ sterreichs (FWF wissenschaftlichen Forschung O grant P15037), by the Austrian National Bank ¨ sterreichische Nationalbank), and by (OeNB—O the GENAU bioinformatics project (BMBWK Austria).
References 1. Holroyd, C. & Erdman, R. (2001). Protein translocation machineries of peroxisomes. FEBS Letters, 501, 6 – 10. 2. Van Den Bosch, H., Schutgens, R. B. H., Wanders, R. J. A. & Tager, J. M. (1992). Biochemistry of peroxisomes. Annu. Rev. Biochem. 61, 157– 197. 3. Gould, S. G., Keller, G. A. & Subramani, S. (1987). Identification of a peroxisomal targeting signal at the carboxy terminus of firefly luciferase. J. Cell Biol. 105, 2923– 2931. 4. Osumi, T., Tsukamoto, T., Hata, S., Yokota, S., Miura, S., Fujiki, Y. et al. (1991). Amino-terminal presequence of the precursor of peroxisomal 3-ketoacyl-CoA thiolase is a cleavable signal peptide for peroxisomal targeting. Biochem. Biophys. Res. Commun. 181, 947–954. 5. Swinkels, B. W., Gould, S. J., Bodnar, A. G., Rachubinski, R. A. & Subramani, S. (1991). A novel, cleavable peroxisomal targeting signal at the aminoterminus of the rat 3-ketoacyl-CoA thiolase. EMBO J. 10, 3255– 3262. 6. Kragler, F., Langeder, A., Raupachova, J., Binder, M. & Hartig, A. (1993). Two independent targeting signals in catalase A of Saccharomyces cerevisiae. J. Cell Biol. 120, 665–673. 7. Elgersma, Y., Van Roermund, C. W., Wanders, R. J. & Tabak, H. F. (1995). Peroxisomal and mitochondrial carnitine acetyltransferases of Saccharomyces cerevisiae are encoded by a single gene. EMBO J. 14, 3472 –3479. 8. Klein, A. T., Van Den Berg, M., Bottger, G., Tabak, H. F. & Distel, B. (2002). Saccharomyces cerevisiae acyl-CoA oxidase follows a novel, non-PTS1, import pathway into peroxisomes that is dependent on Pex5p. J. Biol. Chem. 277, 25011 – 25019. 9. Brocard, C., Kragler, F., Simon, M. M., Schuster, T. & Hartig, A. (1994). The tetratricopeptide repeatdomain of the PAS10 protein of Saccharomyces cerevisiae is essential for binding the peroxisomal targeting signal-SKL. Biochem. Biophys. Res. Commun. 204, 1016– 1022. 10. Fransen, M., Brees, C., Baumgart, E., Vanhooren, C. T., Myriam, B., Mannaerts, G. P. & Van Valdhoven, P. P. (1995). Identification and characterization of the putative human peroxisomal C-terminal targeting signal import receptor. J. Biol. Chem. 270, 7731– 7736. 11. Rehling, P., Marzioch, M., Niesen, F., Wittke, E., Veenhuis, M. & Kunau, W. H. (1996). The import receptor for the peroxisomal targeting signal 2 (PTS2) in Saccharomyces cerevisiae is encoded by the PAS7 gene. EMBO J. 15, 2901– 2913. 12. Purdue, P., Zhang, J. W., Skoneczny, M. & Lazarow, P. B. (1997). Rhizomelic chondrodysplasia punctata
578
13.
14.
15.
16. 17.
18.
19.
20.
21. 22.
23.
24.
25.
26. 27.
is caused by deficiency of human PEX7, a homologue of the yeast PTS2 receptor. Nature Genet. 15, 381– 384. Braverman, N., Steel, G., Obie, C., Moser, A., Moser, H., Gould, S. J. & Valle, D. (1997). Human PEX7 encodes the peroxisomal PTS2 receptor and is responsible for rhizomelic chondrodysplasia punctata. Nature Genet. 15, 369–376. Tsukamoto, T., Hata, S., Yokota, S., Miura, S., Fujiki, Y., Hijikata, M. et al. (1994). Characterization of the signal peptide at the amino terminus of the rat peroxisomal 3-ketoacyl-CoA thiolase precursor. J. Biol. Chem. 269, 6001– 6010. Glover, J. R., Andrews, D. W., Subramani, S. & Rachubinski, R. A. (1994). Mutagenesis of the amino targeting signal of Saccharomyces cerevisiae 3-ketoacyl-CoA thiolase reveals conserved amino acids required for import into peroxisomes in vivo. J. Biol. Chem. 269, 7558– 7563. Gould, S. J., Keller, G. A., Hosken, N., Wilkinson, J. & Subramani, S. (1989). A conserved tripeptide sorts proteins to peroxisomes. J. Cell Biol. 108, 1657–1664. Gatto, G. J., Geisbrecht, B. V., Gould, S. J. & Berg, J. (2000). Peroxisomal targeting signal-l recognition by the TPR domains of human PEX5. Nature Struct. Biol. 7, 1091– 1095. Lametschwandtner, G., Brocard, C., Fransen, M., Van Veldhoven, P., Berger, J. & Hartig, A. (1998). The difference in recognition of terminal tripeptides as peroxisomal targeting signal 1 between yeast and human is due to different affinities of their receptor Pex5p to the cognate signal and to residues adjacent to it. J. Biol. Chem. 273, 33635– 33643. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M., Estreicher, A., Gasteiger, E. et al. (2003). The Swissprot protein knowledgebase and its supplement trembl in 2003. Nucl. Acids Res. 31, 365– 370. Sunyaev, S. R., Eisenhaber, F., Rodchenkov, I. V., Eisenhaber, B., Tumanyan, V. G. & Kuznetsov, E. N. (1999). PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. Protein Eng. 12, 387– 394. Hobohm, U., Scharf, M., Schneider, R. & Sander, C. (1992). Selection of representative protein data sets. Protein Sci. 1, 409– 417. Heringa, J., Sommerfeldt, H., Higgins, D. & Argos, P. (1992). OBSTRUCT: a program to obtain largest cliques from a protein sequence set according to structural resolution and sequence similarity. Comput. Appl. Biosci. 8, 599– 600. Eisenhaber, B., Bork, P. & Eisenhaber, F. (1998). Sequence properties of GPI-anchored proteins near the omega-site: constraints for the polypeptide binding site of the putative transamidase. Protein Eng. 11, 1155– 1161. Tomii, K. & Kanehisa, M. (1996). Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng. 9, 27 – 36. Nakashima, H. & Nishikawa, K. (1992). The amino acid composition is different between the cytoplasmic and extracellular sides in membrane proteins. FEBS Letters, 303, 141– 146. Isogai, Y., Nemethy, G., Rackovsky, S., Leach, S. J. & Scheraga, H. A. (1980). Characterization of multiple bends in proteins. Biopolymers, 19, 1183– 1210. Maurer-Stroh, S., Eisenhaber, B. & Eisenhaber, F. (2002). N-terminal N-myristoylation of proteins: refinement of the sequence motif and its taxonspecific differences. J. Mol. Biol. 317, 523– 540.
PTS1 Motif Refinement
28. Peitsch, M. C. (1995). Protein modeling by e-mail. Biotechnology, 13, 658– 660. 29. Guex, N. & Peitsch, M. C. (1997). SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modelling. Electrophoresis, 18, 2714– 2723. 30. Guex, N., Diemand, A. & Peitsch, M. C. (1999). Protein modelling for all. Trends Biochem. Sci., 24, 364– 367. 31. Sweet, R. M. & Eisenberg, D. (1983). Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure. J. Mol. Biol. 171, 479– 488. 32. Eisenberg, D. & McLachlan, A. (1986). Solvation energy in protein folding and binding. Nature, 319, 199– 203. 33. Levitt, M. (1978). Conformational preferences of amino acids in globular proteins. Biochemistry, 17, 4277– 4285. 34. Palau, J., Argos, P. & Puigdomenech, P. (1982). Protein secondary structure. Int. J. Pept. Protein Res. 19, 394– 401. 35. Burgess, A. W., Ponnuswamy, P. K. & Scheraga, H. A. (1974). Analysis of conformations of amino acid residues and prediction of backbone topography in proteins. Isr. J. Chem. 12, 239– 286. 36. Frishman, D. & Argos, P. (1996). Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. Protein Eng. 9, 133– 142. 37. Frishman, D. & Argos, P. (1997). Seventy-five percent accuracy in protein secondary structure prediction. Proteins, 27, 329– 335. 38. Wootton, J. C. & Federhen, S. (1993). Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17, 149– 163. 39. Nielsen, H., Brunak, S. & Von Heijne, G. (1999). Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Eng. 12, 3 – 9. 40. Ashmarina, L. I., Rusnak, N., Miziorko, H. M. & Mitchell, G. A. (1994). 3-Hydroxy-3-methylglutarylCoA lyase is present in mouse and human liver peroxisomes. J. Biol. Chem. 269, 31929– 31932. 41. Geisbrecht, B. V. & Gould, S. J. (1999). The human PICD gene encodes a cytoplasmic and peroxisomal NADPþ-dependent isocitrate dehydrogenase. J. Biol. Chem. 274, 30527– 30533. 42. Yoshihara, T., Hamamoto, T., Munakata, R., Tajiri, R., Ohsumi, M. & Yokota, S. (2001). Localization of cytosolic NADP-dependent isocitrate dehydrogenase in the peroxisomes of rat liver cells: biochemical and immunocytochemical studies. J. Histochem. Cytochem. 49, 1123–1131. 43. Yang, M., Wu, Z. & Fields, S. (1995). Protein-peptide interactions analyzed with the yeast two-hybrid system. Nucl. Acids Res. 23, 1152– 1156. 44. Eisenhaber, B., Bork, P. & Eisenhaber, F. (1999). Prediction of potential GPI-modification sites in proprotein sequences. J. Mol. Biol. 292, 741– 758. 45. Maurer-Stroh, S., Eisenhaber, B. & Eisenhaber, F. (2002). N-terminal N-myristoylation of proteins: prediction of substrate proteins from amino acid sequence. J. Mol. Biol. 317, 541– 557. 46. Kendall, M. & Stuart, A. (1977). The Advanced Theory of Statistics, Griffen, London. 47. Schneider, T. D. & Stephens, R. M. (1990). Sequence logos: a new way to display consensus sequences. Nucl. Acids Res. 18, 6097– 6100.
579
PTS1 Motif Refinement
48. Rackovsky, S. & Scheraga, H. A. (1977). Hydrophobicity, hydrophilicity, and the radial and orientational distributions of residues in native proteins. Proc. Natl Acad. Sci. USA, 74, 5248– 5251. 49. Nakashima, H., Nishikawa, K. & Ooi, T. (1990). Distinct character in hydrophobicity of amino acid composition in mitochondrial proteins. Proteins: Struct. Funct. Genet., 8, 173– 178. 50. Zvelebil, M., Barton, G., Taylor, W. & Sternberg, M. (1987). Prediction of protein secondary structure and active sites using the alignment of homologous sequences. J. Mol. Biol. 195, 957– 961. 51. Karplus, P. & Schulz, G. (1985). Prediction of chain flexibility in proteins. Naturwiss, 72, 212– 213. 52. Levitt, M. (1976). A simplified representation of protein conformations for rapid simulation of protein folding. J. Mol. Biol. 104, 59 – 107. 53. Argos, P., Rao, J. K. & Hargrave, P. A. (1982). Structural prediction of membrane-bound proteins. Eur. J. Biochem. 128, 565– 575. 54. Zimmerman, J. M., Eliezer, N. & Simha, R. (1968). The characterization of amino acid sequences in proteins by statistical methods. J. Theoret. Biol. 21, 170–201. 55. Fauchere, J. L., Charton, M., Kier, L. B., Verloop, A. & Pliska, V. (1988). Amino acid side chain parameters for correlation studies in biology and pharmacology. Int. J. Pept. Protein Res. 32, 269– 278. 56. Eisenberg, D. (1984). Three-dimensional structure of membrane and surface proteins. Annu. Rev. Biochem. 53, 595– 623.
57. Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science, 185, 862 –864. 58. Vihinen, M., Torkkila, E. & Riikonen, P. (1994). Accuracy of protein flexibility predictions. Proteins: Struct. Funct. Genet., 19, 141– 149. 59. Cid, H., Bunster, M., Canales, M. & Gazitua, F. (1992). Hydrophobicity and structural classes in proteins. Protein Eng. 5, 373– 375. 60. Ponnuswamy, P. K., Prabhakaran, M. & Manavalan, P. (1980). Hydrophobic packing and spatial arrangement of amino acid residues in globular proteins. Biochim. Biophys. Acta, 623, 301– 316. 61. Biou, V., Gibrat, J. F., Levin, J. M., Robson, B. & Garnier, J. (1988). Secondary structure prediction: combination of three different methods. Protein Eng. 2, 185–191. 62. Tanaka, S. & Scherenga, H. A. (1977). Statistical mechanical treatment of protein conformation 5. A multistate model for specific-sequence copolymers of amino acids. Macromolecules, 10, 9 – 20. 63. Bhaskaran, R. & Ponnuswamy, P. K. (1988). Positional flexibilities of amino acid residues in globular proteins. Int. J. Pept. Protein Res. 32, 241– 255. 64. Cohn, E. J. & Edsall, J. T. (1943). Editors of Proteins, Amino Acids and Peptides as Ions and Dipolar Ions, Reinhold, New York.
Edited by J. Thornton (Received 31 October 2002; received in revised form 3 March 2003; accepted 6 March 2003)