Analysis of Void Volumes in Proteins and Application to Stability of the p53 Tumour Suppressor Protein

Analysis of Void Volumes in Proteins and Application to Stability of the p53 Tumour Suppressor Protein

doi:10.1016/j.jmb.2004.10.015 J. Mol. Biol. (2004) 344, 1199–1209 Analysis of Void Volumes in Proteins and Application to Stability of the p53 Tumou...

550KB Sizes 6 Downloads 111 Views

doi:10.1016/j.jmb.2004.10.015

J. Mol. Biol. (2004) 344, 1199–1209

Analysis of Void Volumes in Proteins and Application to Stability of the p53 Tumour Suppressor Protein Alison L. Cuff and Andrew C. R. Martin* School of Animal and Microbial Sciences, University of Reading Whiteknights, P.O. Box 228 Reading RG6 6AJ, UK

We have developed a new method for the analysis of voids in proteins (defined as empty cavities not accessible to solvent). This method combines analysis of individual discrete voids with analysis of packing quality. While these are different aspects of the same effect, they have traditionally been analysed using different approaches. The method has been applied to the calculation of total void volume and maximum void size in a nonredundant set of protein domains and has been used to examine correlations between thermal stability and void size. The tumoursuppressor protein p53 has then been compared with the non-redundant data set to determine whether its low thermal stability results from poor packing. We found that p53 has average packing, but the detrimental effects of some previously unexplained mutations to p53 observed in cancer can be explained by the creation of unusually large voids. q 2004 Elsevier Ltd. All rights reserved.

*Corresponding author

Keywords: protein structure; void volumes; p53; stability; packing

Introduction Protein cavities are holes in the interior of a protein that are not accessible to bulk solvent. They may be large enough to accommodate other atoms or molecules such as water, but may be empty. Empty cavities are termed “voids”. Native, folded proteins are generally well-packed, particularly in their core regions, critical for their correct conformation and stability.1,2 However, internal cavities and packing defects are found in proteins of all sizes and, in proteins of R100 residues, there is almost always at least one cavity large enough to accommodate a water molecule.1,2 Note that the ˚ is 11.49 A ˚ 3, volume of a water probe of radius 1.4 A but any cavity able to accommodate it will be considerably larger as it will not be a perfect hollow sphere. Cavities can be created by a mutation of a larger amino acid to a smaller one. There have been a number of studies on the structural and Present addresses: A. L. Cuff, School of Biological Sciences, Queen Mary, University of London, Mile End Road, London E1 4NS, UK; A. C. R. Martin, Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK. Abbreviations used: MPP, minimum perturbation protocol; IARC, International Agency for Research on Cancer. E-mail address of the corresponding author: [email protected]

thermodynamic response of proteins to cavitycreating mutations.3–7 Structural effects tend to be relatively modest and are generally restricted to the immediate vicinity of the mutation site, with sidechains surrounding the site shifting slightly ˚ ngstro¨m) towards (usually by a few tenths of an A the space vacated by the substituted side-chain.3–5 Eriksson et al.3 reported that the overall structure tends to “relax” slightly in response to a cavitycreating mutation. However, the degree of movement depends on the flexibility of both the amino acid being substituted and the surrounding atoms.3,4 Cavity-creating mutations can decrease protein thermostability;1 for a number of leucine-to-alanine replacements in T4 lysozyme, Eriksson et al.3 noted that there was an approximately linear correlation between the increase in cavity size and decrease in protein stability (i.e. the difference between the free energy of unfolding, DDG, of the native and mutant protein). Similar results were obtained by Xu et al.4 for a number of large non-polar amino acid to alanine substitutions. The loss of protein stability consists of a constant energy term, which can be thought of as the difference in hydrophobicity of the two amino acid residues involved in the substitution, and a cavity-dependent energy term. This second term derives from the van der Waals interactions between the side-chain to be replaced and the surrounding atoms in the native structure. If the mutation occurs in a relatively rigid region of

0022-2836/$ - see front matter q 2004 Elsevier Ltd. All rights reserved.

1200 the protein, the wild-type structure is maintained, resulting in the creation of a cavity with the loss of stabilizing van der Waals interactions as well as an entropic penalty. If the side-chains surrounding the mutation site are flexible enough partially to fill the cavity, additional van der Waals interactions may be generated partially counteracting some of the destabilizing effects of the mutation. If the sidechains are sufficiently flexible to result in the complete collapse of a cavity, any energy loss will be due entirely to a decrease in hydrophobicity.3–5 We have investigated a number of freely available programs for calculating cavity volumes. Kleywegt & Jones8 for example, developed a grid-based program, VOIDOO, which maps a protein onto a 3D grid and then assigns grid points as protein, bulk solvent, or cavity depending on their location in relation to the protein structure. In comparison, VOLBL9 is an analytical method that utilizes “alpha-shapes” to calculate internal cavities for both the solvent accessible and molecular surface models. Both VOIDOO and VOLBL use probes of the same size to detect the solvent accessible surface of a protein and its internal cavities. However, this leads to problems if one wishes to examine cavities smaller than those that can be occupied by a water molecule. If one uses too small a probe, the cavities will “leak” to the bulk solvent; larger (water-sized) probes will not detect smaller cavities. This problem would be solved by using two probes: one to delimit the solvent accessible regions, the second to identify voids. In principle, it should be possible to modify either of these programs to allow different void and solvent probes, but source code is freely available for neither. In addition, the concept of a probe in VOIDOO is achieved by increasing the radii of the protein atoms rather than using a non-zero probe size, making it impossible to use two probes without major changes to the software. If one is interested only in those cavities large enough to accommodate other atoms or molecules, ˚ will avoid any using a probe size of radius 1.2–1.4 A problems associated with a very small probe size. However, we are interested in cavities of almost any size; any alteration in the close packing of a protein may have an effect on stability and is particularly important when considering the effects of mutations on marginally stable proteins such as the tumour suppressor protein p53.10 In the past, the analysis of discrete voids, large enough to accommodate a probe of a specified size (as performed by VOIDOO and VOLBL), and of packing quality, where voids of any size are considered, have been treated as completely separate problems. The difference in approach is illustrated by several programs available that assess packing quality. For example, a number of groups have used Voronoi polyhedra to calculate the volumes occupied by atoms.11–15 Richards defined packing density as “the ratio of the volume enclosed by the van der Waals envelope of a given molecule or

Analysis of Void Volumes

atom to the actual volume of space that it occupies”.15 For closely packed atoms, packing density was found to be 0.74 (no units). However, this method works well only for buried residues and the “occluded surface algorithm” was later introduced16 to overcome this problem. This method was used in an extended analysis by Fleming & Richards.17 A number of other methods have been developed including the rapid algorithm QPack.18 This uses a simplified model of a protein chain in which each amino acid is represented as one to three spheres. These are allowed to grow until they touch one another and the relative sphere radii give an estimate of packing quality. Other methods are listed by Fleming & Richards.17 In reality, identifying discrete voids and analysing packing quality are simply extremes of the same phenomenon: non-optimal packing of atoms in a protein. Since we are interested in both general packing quality and discrete voids sufficiently large to destabilize a protein, we have created our own grid-based void calculation program, similar in principle to VOIDOO, using two probes, one for identifying the solvent accessible grid points, or exterior surface, of the protein (“solvent” probe) and one for detecting voids (“void” probe). Uniquely, this method allows for very small or zero sized void probes without “leakage” of the void into the bulk solvent. We are thus able to assess both packing quality and individual voids, previously treated as separate problems, using a single method. We implemented this in the program AVP (“Another Void Program”) and used it to analyze void volumes in a large nonredundant set of protein structures. We then went on to apply this analysis to the p53 core domain, to assess (1) whether the marginal stability of p53 is a result of poor packing in the core domain, (2) if any known mutations are likely to destabilize p53 through introduction of an unacceptably large void.

Results AVP was initially run on our data set (see Methods) using a void probe with a radius of ˚ and solvent probe with a radius of 1.4 A ˚ . The 0.0 A histograms in Figure 1 show the distributions of total void volume and largest void volume for each structure in the data set. The distributions of void volumes in the two histograms are quite similar. This indicates that the total void volume of each structure consisted mainly of the volume of the largest void present. Table 1 shows some selected examples and shows that there were a number of particularly large single ˚ 3. With the exception of voids, some over 15,000 A proteins like the chaperones, it is unlikely that single discrete voids of this size actually exist, as they would have a huge destabilizing effect. Visual examination using RasMol19 of one example, domain 1 from chain A of PDB file 1js420 (1js4A1) overlayed with void points calculated with the

1201

Analysis of Void Volumes

Figure 1. Histograms of (a) total void volumes and (b) largest void volumes for structures in the data set calculated with AVP using a ˚ void probe size with radius 0.0 A and solvent probe size with radius ˚ . Most frequent total void 1.4 A volumes were in the range 750– ˚ 3 while largest void volumes 3800 A ˚ 3. The were in the range 750–3500 A two histograms are very similar, indicating that the majority of the total void volume is contributed by a single void.

˚ void probe (Figure 2) showed that channels 0.0 A existed between closely packed atoms (including through the axes of a-helices), allowing the zerosized probe to merge what one visually would classify as distinct voids.

will give a more intuitive separation of distinct voids. We therefore used chain B of the p53 structure (PDB code: 1tsr21) to optimize the probe ˚ radius were used in size. Probes of size 0.0–0.7 A ˚ steps; 1000 random orientations of the protein 0.1 A were generated in each case and the total and largest void volumes were calculated for each orientation (Figure 3). The mean total void volume ˚3 remained fairly constant at approximately 3040 A ˚ , dropping (snK1Z40) with probe sizes up to 0.3 A

Optimization of probe size Using a larger void probe size will mean that some very small void regions are not included, but

Table 1. The Table showing total and largest void volumes and total number of voids for a selection of structures in the ˚ radius solvent probe and void probes of 0.0 A ˚ and 0.5 A ˚ radii data set, as calculated by AVP using a 1.4 A Probe radius ˚ 0.0 A Structure

Total void ˚ 3) volume (A

Largest void ˚ 3) volume (A

1cuk03 1d2nA2 2scpA0 1qmgA2 1ayx00 1thg00

202.5 1099.0 2741.0 8440.0 14,978.4 16,906.7

158.1 989.4 2564.2 7957.0 14,606.0 16,468.0

˚ 0.5 A Number of voids 25 75 118 248 317 365

Total void ˚ 3) volume (A 169.4 891.8 2287.1 6346.3 11,674.7 13,927.5

Largest void ˚ 3) volume (A 42.1 104.5 369.1 176.8 520.8 2487.2

Number of voids 16 69 144 501 780 843

1202

Figure 2. Figure illustrating the “leaking” effect of using too small a void probe size in AVP. Images are of the protein structure 1js4A1. Spheres represent void points and clusters of void points of the same colour represent the same void. (a) Shows voids calculated using a void ˚ . The vast majority of void points are probe of radius 0.0 A dark blue, showing a huge void leaking though much of the protein interior. (b) Shows voids calculated with a ˚ . The space once occupied by void probe of radius 0.5 A the one huge void is now seen to be split into a number of different, large voids.

approximately linearly with larger probe sizes. The largest void volume dropped rapidly, and approximately linearly, as the void size increased from ˚ to 0.4 A ˚ radius with snK1 ranging from 10.8 to 0.0 A ˚ . From 0.5 A ˚ 287 peaking at a probe radius of 0.3 A probe radius, the total void volume decreased slowly and approximately linearly. Figure 3 also shows that the difference between total and largest ˚ probe void volume is maximal at around 0.4–0.5 A radius.

Analysis of Void Volumes

Grid-based methods such as VOIDOO and AVP are sensitive to orientation of the protein on the grid. We have analysed the levels of errors obtained in evaluating total and maximal void volumes. In addition to the p53 structure two representative examples of mainly a (PDB codes: 1cem22 and 1lrv23), mainly b (1hxn24 and 1thv25) and mixed-ab (1thm26 and 1nar27) proteins were selected at random from CATH. For each of these seven examples, 1000 random orientations were generated and total and maximal void volumes were calculated using AVP (data not shown). For the total void volume, in all cases, the values followed an approximately normal distribution and mean and standard deviation values were calculated. The standard deviation ranged from 0.8–1.8% of the mean. Thus total void volume is relatively insensitive to orientation on the grid. In contrast, for the largest void volume, the form of the curves was more diverse, generally showing a form closer to a Poisson distribution. While calculation of a standard deviation and mean is not therefore strictly relevant to these curves, standard deviations of 18–30% of the mean were calculated for comparison with the total void volume. The high-value tails of the distributions were of low frequency, suggesting that the effects on the large-scale evaluation of void size distributions will be minimal. The p53 structure showed the lowest variation of largest void volume with orientation. For general use, we would therefore recommend running AVP on multiple random orientations of the protein (preferably several 100 or more) and selecting the most frequently occurring values. For the p53 structure, the standard deviations, illustrated by error bars, are shown in Figure 3. These show that sensitivity to orientation for p53 ˚ and is very low falls off for probe sizes above 0.3 A ˚ probe size used in this analysis. at the 0.5 A ˚ was selected for all future A probe radius of 0.5 A analysis on the basis of: (1) wanting as small a probe as possible in order to minimize the effect on total void volume; (2) having a low standard deviation of void volumes to minimize the effect of grid orientation on the resulting calculated void volume; (3) minimizing the effect of probe radius on void sizes; (4) maximizing the difference between total and largest void volumes. ˚ , the histograms shown At a probe radius of 0.5 A in Figure 4 demonstrate that total void volumes are very much larger than the volume of the largest void for each of the structures in our data set. The largest single void was now only approximately ˚ 3, 99% of all largest single void volumes were 2640 A ˚ 3 and 80% were smaller than smaller than 725 A 3 ˚ 275 A (see Table 2). Figure 2(b) confirms that the “leaking voids” problem is indeed resolved at a ˚. probe radius of 0.5 A Cavity-creating mutations in p53 In a recent study by Martin et al.28, an automated protocol was developed to classify the effects of

Analysis of Void Volumes

Figure 3. Effects of probe radius on mean void sizes (largest and total) calculated for 1000 orientations of the p53 structure (1tsr). Standard deviations are indicated with error bars.

mutations on the p53 core domain according to their likely effects on the local structure of the protein. Methods included in this protocol identify

1203 mutations that affect hydrogen-bonding, mutations from glycine, to proline, or to conserved residues, mutations that result in clashes and mutations that involve DNA-binding or zinc binding. Using this protocol, a total of 490 out of 882 distinct single-site mutations (55.6%) in the p53 core domain, representing 7642 out of 9824 observed mutations (77.8%), present in Release 4 of the IARC TP53 mutation database29 were explained on a purely structural basis. We have now updated this analysis and, in Release 8, 518 out of 941 distinct mutations (55.1%), or 10,303 out of 13,111 observed mutations (78.6%) can be explained. A significant number of mutations remained for which an explanation had yet to be found and one additional unexplored structural effect was that of cavity-creating mutations. AVP was therefore used to assess the structural effects of “larger-to-smaller” amino acid substitutions. Void volumes for native p53 and a number of cavity-creating mutants were ˚ and calculated using a void probe of radius 0.5 A ˚ . As shown in Table 2, solvent probe of radius 1.4 A 80% of the largest voids in our data set, calculated

Figure 4. Histograms showing distribution of (a) total void volume and (b) largest void volume for structures in the data set calculated with AVP using a ˚ radius void probe and 1.4 A ˚ 0.5 A radius solvent probe. Most frequent total void volumes were ˚ 3 and within the range 300–3450 A 3 ˚ in the range 50–225 A for largest void volume. Note, in contrast to Figure 1, the difference in scale of the two histograms.

1204

Analysis of Void Volumes

Table 2. The Table showing largest individual void sizes observed in fractions of our data set, i.e. in 99% of proteins ˚3 studied, all voids are smaller than 725 A ˚ 3) Largest void volume (A

Cutoff value (%) 99 98 90 80

725.0 625.0 375.0 275.0

˚ radius solvent probe and Values were calculated using 1.4 A ˚ radius void probe 0.5 A

˚ radius void probe, were less using the 0.5 A 3 ˚ than 275 A in volume, suggesting that voids of this size are generally well-tolerated by most protein structures. We decided therefore only to investigate those larger-to-smaller amino acid ˚ 3 or greater, mutations that create voids of 275 A i.e. ones that are more likely to be destabilizing. Results are shown in Table 3. The total void volume of native p53, as calculated ˚ 3, and its largest void volume by AVP, is 2728.1 A 3 ˚ . Both of these are within the normal range 157.5 A when compared with the volumes from the structures in our data set (see Figure 4). As voids ˚ 3 are seen in !20% of proteins and the of O275 A largest void volume of each of the mutants in Table 3 is at least 1.8! the size of the largest void in native p53, these are likely to have a detrimental effect on stability of the native p53 conformation. Table 3 also shows that all but one of the mutant p53 structures have fewer voids than the native

protein. This is owing to the mutations causing two or more adjacent voids present in the native structure to merge into one larger void (see Figure 5). Stability of p53 To test the hypotheses that (1) the instability of p53 is a result of poor packing of the core domain and (2) further disruption of this packing is the major cause of dis-function of the protein resulting from cavity-creating mutations, we compared the total void volume of p53 (TmZ42 8C) with the much more stable BPTI (TmZ87 8C).30 Figure 6 shows the distribution of total void volumes normalized by the protein size (expressed as the number of residues) using the set of structures selected from the CATH database. As the histogram shows, this follows a normal distri˚ 3/residue and snK1 of bution with a mean of 13.98 A ˚ 3/residue. Native p53 has a total void volume 4.47 A ˚ 3/residue (Z-scoreZ0.018) and therefore of 14.06 A shows very normal packing. The low thermal stability of p53 cannot therefore be attributed to poor packing. BPTI has a total void volume of ˚ 3/residue and while the Z-score (0.152) is not 13.3 A

Table 3. Void volumes in p53 mutants compared with the native p53 core domain Structure Native Met133Thr Met133Val Val143Ala Val143Gly Ala159Gly Val173Gly Leu194Ile Leu194Pro Val216Gly Tyr234Asn Tyr234Asp Tyr234Cys Tyr234His Tyr234Ser Tyr236Cys Tyr236Ser Ile255Ser Phe270Cys Phe270Ser Phe270Val

Total void ˚ 3) volume (A

Largest void ˚ 3) volume (A

Number of voids

2728.1 2780.6 2775.0 2768.4 2788.8 2753.0 2791.0 2729.0 2763.7 2794.7 2787.0 2792.1 2799.7 2757.3 2814.1 2796.1 2812.0 2781.9 2797.0 2810.8 2784.2

157.5 303.0 297.0 323.1 345.5 314.7 306.3 363.6 293.8 334.2 308.3 313.6 342.4 283.8 365.2 372.0 401.5 293.0 288.5 410.5 291.0

177 174 176 170 169 177 174 173 173 174 175 175 172 174 171 172 170 174 172 170 170

Chain B of PDB entry 1tsr was used for the native structure. Mutants were generated using the “minimum perturbation protocol”. Void volumes and number of voids were calculated ˚ and void probe of using AVP with solvent probe of radius 1.4 A ˚ and largest void volumes greater than 275 A ˚ 3 are radius 0.5 A shown.

Figure 5. Separate voids are merged into one after a larger-to-smaller amino acid substitution. In (a), residue Phe109 is shown in the native p53 core domain surrounded by a number of different voids (shown as clusters of different coloured spheres). In (b), Phe109 has been replaced with a serine. Most of the spheres surrounding the residues are now of one colour, indicating that the voids present in the native protein have now merged.

1205

Analysis of Void Volumes

Figure 6. Histogram of normalized total void volumes using a selection of structures from the CATH database. The locations of BPTI, native p53 and the average value for our set of cavity-creating mutations are indicated. It can clearly be seen that both the native p53 and mutants are within the normal distribution range and are therefore no more loosely packed than the majority of protein structures in our data set.

particularly large, it is clear that BPTI is better packed than p53. The cavity-creating mutations clearly increase the total cavity volume as well as introducing large single cavities. While the overall packing quality is not responsible for the low stability of p53 and the total cavity volume of mutants is still well within the normal range (Z-scores 0.02–0.11), a small reduction in stability as a result of packing may be enough to disrupt folding of the p53 core domain. General stability analysis Having shown that the stability of native p53 is not directly influenced by poor packing, or by having unusually large voids, we decided to look at whether these are generally factors which influence protein stability. Using ProTherm30† we selected all proteins with known melting temperatures and PDB codes. ProTherm contains results collected from a large number of experimental papers and melting temperatures for the same protein can vary. We therefore averaged the values for each PDB file and, having removed a few large (O200 residues) or non-globular proteins, obtained a dataset containing 354 PDB files with known Tm. AVP was used to calculate the total and maximum void size for each of these structures. Figure 7 shows plots of Tm against (a) total void volume (normalized by the protein size expressed as the number of residues) and (b) largest void volume. The Pearson’s r correlation coefficients are K0.07 and K0.25, respectively. Thus, while there is no correlation with total void volume, there is a weak correlation with largest void volume (i.e. as the largest void size increases, the Tm and hence the stability of the protein decreases). While packing does not seem to be a major influence on protein stability, as expected it does † http://gibk26.bse.kyutech.ac.jp/jouhou/Protherm/ protherm.html

seem to make a contribution, supporting the notion that introduction of large voids in p53 will destabilize the protein.

Discussion AVP unites the analysis of individual discrete voids and overall packing quality using a single algorithm. To achieve this, separate probes are used to delineate solvent accessible regions and to detect voids, thus resolving problems associated with using a single probe size. While overall packing quality can be assessed using a zero-sized void probe, analysis of discrete voids requires a larger probe in order to prevent leakage from one void to another. Thus one needs to select a void probe size that is (1) sufficiently large that voids cannot leak and is (2) sufficiently small that it will detect voids that are too small to accommodate, for example, a water molecule, but none-the-less may be important. Here, we found that a void probe size of radius ˚ seemed to be too large to pass though gaps 0.5 A between closely packed protein atoms, but was small enough to detect most important small voids. This value also minimizes the standard deviation of void sizes calculated with different orientations of the protein on the grid while giving a close to maximal difference between largest and total void volumes. We used this void probe size for both the analysis of total void volume (i.e. packing quality) and largest discrete void. In principle, we are trying to estimate the absolute void volume (i.e. the solvent accessible volume of the protein minus the volume occupied by the atoms). This can be obtained by setting the void probe size to zero. However, even optimal packing of atoms will generate voids if calculated in this way and channels form between well packed atoms includ˚ ing along the axis of an a-helix. By using a 0.5 A probe, any voids unable to accommodate the probe ˚ 3) are not detected and some (of volume 0.52 A

1206

Analysis of Void Volumes

Figure 7. Plots of melting temperature, extracted from the ProTherm database against void volumes calculated with AVP ˚ radius void probe using a 0.5 A ˚ radius solvent probe. and a 1.4 A (a) Total void volume normalized by number of residues; (b) largest void volume.

packing defects may have been lost. However, detecting voids between optimally packed atoms ˚ probe has a negligible is not useful and the 0.5 A effect on our packing analysis results. We have investigated the possibility that p53 is unstable owing to its core domain being poorly packed. However, we have shown that this is not the case: p53 is no more loosely packed than many other proteins, most of which are much more stable. However, the decrease in packing quality observed in cavity-creating mutations may be sufficient to destabilize the marginally stable p53 and reduce its Tm below body temperature. It is worth noting, that the p53 core domain is relatively large at approximately 200 residues in size and as such has a high enthalphy of denaturation. At 37 8C, its stability is only 3 kcal/mol, at 43 8C (the protein’s Tm), it is, by definition, 0 kcal/mol. Any alteration in the packing of the core domain may well decrease stability enough (i.e. by 3 kcal/mol) to lower its Tm to 37 8C and cause 50% denaturation at body temperature.10

We have also looked at a set of 354 proteins from the Protein Data Bank for which Tm values are known and available in the ProTherm30 database. The largest void size shows a weak correlation with the melting temperature, confirming the notion that introduction of a large void into p53 is likely to destabilize the structure. The un-normalized total void volume does show a weak correlation with Tm (Pearson’s rz0.3), but this is largely the effect of protein size. Unsurprisingly, total void volume increases with the number of residues in the protein (Pearson’s rz0.9) and Tm shows a weak correlation with the number of residues (Pearson’s rz0.4). These results confirm that normalization by protein size is the correct strategy. Grid-based methods such as VOIDOO and AVP are sensitive to the orientation of the protein on the grid. However, with the off-grid refinements we have included (see Methods), we have shown that total void volume is relatively insensitive to orientation (snK1 is 0.8–1.8% of the mean,

1207

Analysis of Void Volumes

depending on probe size) giving us confidence in our conclusion that the low stability of p53 is not a result of poor overall packing. The largest void volume is more sensitive to orientation, but is small ˚ . Orientation at the selected void probe size of 0.5 A effects will be averaged out in the distributions based on our dataset of 8925 proteins. Fortuitously, the p53 structure was less sensitive to orientation than the other structures tested. Since a single orientation was used for comparison of void sizes in all the p53 mutants, relative values can be safely compared. In assessing the effects of mutations on p53, the “minimum perturbation protocol” (MPP) for sidechain replacement31 was used. Adjacent side-chain or backbone movements are not considered although, in reality, slight readjustments of some of these may reduce the destabilising effects of these mutations. Coupled perturbation32 would allow adjacent side-chains to be considered, but some assessment of backbone flexibility would be required for full evaluation of possible rearrangments. In addition, we make the assumption that voids opened by side-chain replacements are not filled with water molecules. Studies of lysozyme mutations3,4 suggest that this is generally a valid assumption, but predictions of potential solvation sites33 may be of value in the future. Of the 20 distinct cavity-creating mutations identified in p53 as likely to affect protein stability during this analysis, nine had already been explained by other factors in the automated method of Martin et al.28 This left 11 mutations for which no explanation had previously been given. If these 11 distinct mutations are now included with those from the automated analysis, it is possible to explain 529 distinct (56.2%) or 10,453 observed (79.7%) mutations. In summary, we have developed a new method for analysis of void volumes in proteins. This has been applied to a non-redundant set of protein domains from CATH and used to assess void volumes in the p53 tumour suppressor protein. We have shown that packing quality is not, in general, correlated with protein stability and that poor packing does not seem to be a contributory factor to the low thermal stability of p53. However, mutations can open voids which may destabilize the native p53 fold sufficiently that it is no longer correctly folded at body temperature.

Methods Protein structure dataset The list of protein structures used in this study was obtained using the CATH classification of protein domains (v2.3)†. A Perl script was written to select only those non-redundant sequence family representatives ˚ . Only those (S-Reps) with a resolution of %2.00 A

structures possessing at least one water molecule were included to ensure that any voids detected were likely to be genuine voids and not cavities that should be solvated. A total of 8925 structures were obtained. Further Perl scripts were written to build a library of the selected protein domains in PDB format with interacting water molecules included in the files. AVP The AVP algorithm proceeds as follows: Grid construction A grid is constructed around the protein; this grid is large enough that at least one plane of water probes can be placed on all sides. Initially each point on the grid is assigned as being of type void. By default, the spacing of ˚. this grid is 1 A Protein assignment The grid is searched and any grid point that is within an atom sphere is changed to type protein. In order to speed this process, the list of atoms is first sorted along the x-axis. When grid points are checked, a maximum and minimum possible x-coordinate are calculated from the current x-grid position plus or minus the maximum ˚ ). A binary search of the radius of a protein atom (1.9 A sorted atom list is performed to find only those atoms within a yz slice of the protein neighbouring the required x-coordinate. Simple maximum distance checks are made on the y and z-coordinates before calculation of actual distances. Also, while performing this “walk” across the grid, a list of atoms within two atom radii of each grid point is associated with that point. This is used later during void volume refinement (see below). Solvent assignment The six surfaces of the grid are all assigned as type water. By moving along the positive and negative x, y and z-axes in turn, each point currently assigned as void is converted to solvent if a solvent probe can be placed at that point without clashing with protein and at least one of the 26 neighbouring points is already solvent (i.e. six face-connected, 12 edge-connected and eight cornerconnected neighbours). The same optimization of using an atom list sorted on x-coordinates with a binary search, described for protein assignment, is used during this stage. As well as setting each primary grid point to type solvent, any other points within the solvent radius (optionally multiplied by a “solvent expansion factor”) are also converted to type solvent. This process then iterates until no new points are converted from void to solvent. Points within a solvent radius must be converted to solvent as, if this is not done, a “shell” of void points is formed all around the protein surface. In the strategy adopted by Kleywegt & Jones8, these points were instead assigned as protein as they grew the protein atoms by the probe radius. An alternative to the iterative strategy adopted here would be to use a flood-fill in 3D. However, the standard flood-fill algorithm is recursive and is computationally impractical for problems of this size. Void clustering

† http://www.biochem.ucl.ac.uk/bsm/cath/

The preceding steps have flagged grid points classified

1208 as protein or solvent; remaining points, still assigned as type void, are now true voids. These are clustered to determine the distinct void regions in the protein. This is done by walking along the grid to find void points. Once a void point is found, a standard 3D flood-fill is started to cluster all connected adjacent points into the same void. The walk across the grid then continues until another void point is found that is not yet assigned to a void cluster. Void volume refinement At this stage an estimate of the void volumes may be made by assuming each void point represents a voxel of volume equal to the grid spacing cubed. At small or zero void probe sizes this will generally be a large overestimate although some grid points may have been assigned as protein whereas the voxel they represent may be partially void. To improve accuracy, each void voxel and all neighbouring protein voxels are therefore split into 1000 sub-voxels and the total number of void sub-voxels is then counted. Only those atoms in the list associated with each original grid point are examined when making this assignment to speed up the search. Like any grid-based method, AVP can suffer from errors resulting from the orientation of the protein on grid. However such methods are commonly used.8,34,35 We have taken steps to minimize these effects. As described above, voxels are initially assigned as protein by checking whether the centre of the voxel is within the van der Waals radius of a protein atom. We then check each voxel, initially assigned as protein, to see whether it contains any off-centre points at which a probe ˚ could fit. To do this, we build a of minimum radius 0.1 A list of neighbouring atoms and look at each pair of atoms in turn. If two atoms are separated by more than the sum of their radii plus the diameter of the probe, then we walk ˚ steps along the vector between the two atoms in 0.05 A and check whether any point is at least a probe radius away from all other atoms: if so, the voxel is reassigned as void. A voxel can thus be assigned as void even if the grid spacing assigns it as protein. In the last step of the procedure described above, every voxel assigned as void is refined by splitting it into 1000 sub-voxels each of which is individually assigned as protein or void. Secondly we examine all surface protein voxels and, in a similar way, reassign them as solvent if a solvent probe can be placed at any point along the vector between the centre of this voxel and the centre of an adjacent protein voxel. Again, this allows voxels whose centres are within the protein to be treated as solvent and, after voxel refinement (which reassigns sub-voxels as solvent or protein) produces a more accurate evaluation of void volume reducing leakage of voids to the protein surface. Our analysis of variation of void volume with probe size (see Figure 3) shows that when using probe size of at ˚ , the effect of orientation on calculated void least 0.5 A volumes is small. The AVP software may be downloaded.† Cavity-creating mutations in p53 A Perl script was written to retrieve mutations resulting in larger-to-smaller amino acid substitutions present in Release 8 of the International Agency for Research on Cancer (IARC) TP53 mutation database.29 As we were † http://www.bioinf.org.uk/software/avp/

Analysis of Void Volumes

only interested in those mutations resulting in internal cavities, the relative solvent accessibility36 of the sidechain of the residue to be substituted in each mutation was calculated with NAccess (J. Hubbard & J. M. Thornton, unpublished results). Each residue with a solvent accessibility of less than 5% was assumed to be buried within the protein structure. Each mutation involving a buried residue was then modelled using the program MutModel,28 which implements the minimum perturbation protocol31 for side-chain replacements. Total and largest cavity volumes were then compared with the results obtained from the CATH S-Reps.

Acknowledgements A.L.C. was funded by a UK Medical Research Council priority studentship in Bioinformatics.

References 1. Hubbard, S. J., Gross, K. H. & Argos, P. (1994). Intramolecular cavities in globular proteins. Protein Eng. 7, 613–626. 2. Williams, M. A., Goodfellow, J. M. & Thornton, J. M. (1994). Buried waters and internal cavities in monomeric proteins. Protein Sci. 3, 1224–1235. 3. Eriksson, A. E., Baase, W., Zhang, X.-J., Heinz, D., Blaber, M., Baldwin, E. P. & Matthews, B. W. (1992). Response of a protein structure to cavity-creating mutations and its relation to the hydrophobic effect. Science, 255, 178–183. 4. Xu, J., Baase, W. A., Baldwin, E. & Matthews, B. (1998). The response of T4 lysozyme to large-to-small substitutions within the core and its relation to the hydrophobic effect. Protein Sci. 7, 158–177. 5. Matthews, B. W. (1993). Structural and genetic analysis of protein stability. Annu. Rev. Biochem. 62, 139–160. 6. Buckle, A., Cramer, P. & Fersht, A. R. (1996). Structural and energetic responses to cavity-creating mutations in hydrophobic cores: observation of a buried water molecule and the hydrophilic nature of such hydrophobic cavities. Biochemistry, 35, 4298– 4305. 7. Vlassi, M., Cesareni, G. & Kokkinidis, M. (1998). A correlation between the loss of hydrophobic core packing interactions and protein stability. J. Mol. Biol. 285, 817–827. 8. Kleywegt, G. & Jones, T. A. (1994). Detection, delineation, measurement and display of cavities in macromolecular structures. Acta Crystallog. sect. D, 50, 178–185. 9. Edelsbrunner, H., Facello, M., Fu, P. & Liang, J. (1995). Measuring proteins and voids in proteins. Proc. 28th Annu. Hawaii Int. Conf. Syst. Sci. 5, 256–264. 10. Bullock, A. N. & Fersht, A. R. (2001). Rescuing the function of mutant p53. Nature Rev. Cancer, 1, 68–76. 11. Lesk, A. M. & Chothia, C. (1980). Solvent accessibility, protein surfaces, and protein folding. Biophys. J. 32, 35–47. 12. Ptitsyn, O. B. & Volkenstein, M. V. (1986). Protein structure and neutral theory of evolution. J. Biomol. Struct. Dynam. 4, 137–156.

Analysis of Void Volumes

13. Gerstein, M., Sonnhammer, E. L. & Chothia, C. (1994). Volume changes in protein evolution. J. Mol. Biol. 236, 1067–1078. 14. Tsai, J., Taylor, R., Chothia, C. & Gerstein, M. (1999). The packing density in proteins: standard radii and volumes. J. Mol. Biol. 290, 253–266. 15. Richards, F. M. (1974). The interpretation of protein structures: total volume, group volume distributions and packing density. J. Mol. Biol. 82, 1–14. 16. Pattabiraman, N., Ward, K. B. & Fleming, P. J. (1995). Occluded molecular surface: analysis of protein packing. J. Mol. Recog. 8, 334–344. 17. Fleming, P. J. & Richards, F. M. (2000). Protein packing: dependence on protein size, secondary structure and amino acid composition. J. Mol. Biol. 299, 487–498. 18. Gregoret, L. M. & Cohen, F. E. (1990). Novel method for the rapid evaluation, of packing in protein structures. J. Mol. Biol. 211, 959–974. 19. Sayle, R. A. & Milner-White, E. (1995). Rasmol: biomolecular graphics for all. TIBS, 20, 374–376. 20. Sakon, J., Irwin, D., Wilson, D. & Karplus, P. A. (1997). Structure and mechanism of endo/exocellulase E4 from Thermomonospora fusca. Nature Struct. Biol. 4, 810–818. 21. Cho, Y., Gorina, S., Jeffrey, P. D. & Pavletich, N. P. (1994). Crystal structure of a p53 tumor suppressor– DNA complex: understanding tumorigenic mutations. Science, 265, 346–355. 22. Alzari, P., Souchon, H. & Dominguez, R. (1996). The crystal structure of endoglucanase CelA, a family 8 glycosyl hydrolase from Clostridium thermocellum. Structure, 4, 265–275. 23. Peters, J. W., Stowell, M. H. & Rees, D. C. (1996). A leucine-rich repeat variant with a novel repetitive protein structural motif. Nature Struct. Biol. 3, 991–994. 24. Faber, H. R., Groom, C. R., Baker, H. M., Morgan, ˚ crystal W. T., Smith, A. & Baker, E. N. (1995). 1.8 A structure of the C-terminal domain of rabbit serum haemopexin. Structure, 3, 551–559. 25. McPherson, A. & Weickmann, J. (1990). X-ray analysis of new crystal forms of the sweet protein thaumatin. J. Biomol. Struct. Dynam. 7, 1053–1060. 26. Teplyakov, A. V., Kuranova, I. P., Harutyunyan, E. H., Vainshtein, B. K., Fro¨mmel, C., Hohne, W. E. & Wilson, K. S. (1990). Crystal structure of thermitase at ˚ resolution. J. Mol. Biol. 214, 261–279. 1.4 A

1209 27. Hennig, M., Schlesier, B., Dauter, Z., Pfeffer, S., Betzel, C., Hohne, W. E. & Wilson, K. S. (1992). A TIM barrel protein without enzymatic activity? Crystal-structure ˚ resolution. FEBS Letters, 306, of narbonin at 1.8 A 80–84. 28. Martin, A. C. R., Facchiano, A. M., Cuff, A. L., Hernandez-Boussard, T., Olivier, M., Hainaut, P. & Thornton, J. M. (2002). Integrating mutation data and structural analysis of the TP53 tumour-suppressor protein. Hum. Mut. 19, 149–164. 29. Hainaut, P., Hernandez-Boussard, T., Robinson, A., Rodriguez-Tome, P., Flores, T., Hollstein, M. et al. (1998). IARC database of p53 gene mutations in human tumours and cell lines: updated compilation, revised formats and new visualisation tools. Nucl. Acid Res. 26, 205–213. 30. Bava, K. A., Gromiha, M. M., Uedaira, H., Kitajima, K. & Sarai, A. (2004). ProTherm, version 4.0: thermodynamic database for proteins and mutants. Nucl. Acids Res. 32, D120–D121. 31. Shih, H. H. L., Brady, J. & Karplus, M. (1985). Structure of proteins with single-site mutations: a minimum perturbation approach. Proc. Natl Acad. Sci. USA, 82, 1697–1700. 32. Snow, M. E. & Amzel, L. M. (1986). Calculating threedimensional changes in protein structure due to amino acid substitutions in the variable region of immunoglobulins. Proteins: Struct. Funct. Genet. 1, 267–279. 33. Ehrlich, L., Reczko, M., Bohr, H. & Wade, R. C. (1998). Prediction of protein hydration sites from sequence by modular neural networks. Protein Eng. 11, 11–19. 34. Barford, D., Schwabe, J. W., Oikonomakos, N. G., Acharya, K. R., Hajdu, J., Papageorgiou, A. C. et al. (1988). Channels at the catalytic site of glycogen phosphorylase B: binding and kinetic studies with the beta-glycosidase inhibitor D-gluconohydroximo-1,5lactone N-phenylurethane. Biochemistry, 27, 6733– 6741. 35. Goodford, P. J. (1985). A computational procedure for determining energetically favorable binding sites on biologically important macromolecules. J. Med. Chem. 28, 849–857. 36. Lee, B. K. & Richards, F. M. (1971). The interpretation of protein structures: estimation of static accessibility. J. Mol. Biol. 55, 379–400.

Edited by J. Thornton (Received 11 May 2004; received in revised form 24 September 2004; accepted 12 October 2004)