Sequence-dependent DNA Structure: A Database of Octamer Structural Parameters

Sequence-dependent DNA Structure: A Database of Octamer Structural Parameters

doi:10.1016/j.jmb.2003.08.006 J. Mol. Biol. (2003) 332, 1025–1035 Sequence-dependent DNA Structure: A Database of Octamer Structural Parameters Elea...

311KB Sizes 0 Downloads 44 Views

doi:10.1016/j.jmb.2003.08.006

J. Mol. Biol. (2003) 332, 1025–1035

Sequence-dependent DNA Structure: A Database of Octamer Structural Parameters Eleanor J. Gardiner1,2*, Christopher A. Hunter1, Martin J. Packer1 David S. Palmer1 and Peter Willett2 1 Department of Chemistry University of Sheffield Sheffield S3 7HF, UK 2

Department of Information Studies, University of Sheffield Sheffield S1 4DP, UK

We have constructed the potential energy surfaces for all unique tetramers, hexamers and octamers in double helical DNA, as a function of the two principal degrees of freedom, slide and shift at the central step. From these potential energy maps, we have calculated a database of structural and flexibility properties for each of these sequences. These properties include: the values of each of the six step parameters (twist roll, tilt, rise, slide and shift), for each step of the sequence; flexibility measures for both decrease and increase in each property value from the minimum energy conformation for the central step; and the deviation from the path of a hypothetical straight octamer. In an analysis of structural change as a function of sequence length, we observe that almost all DNA tends to B-DNA and becomes less flexible. A more detailed analysis of octamer properties has allowed us to determine the structural preferences of particular sequence elements. GGC and GCC sequences tend to confer bistability, low stability and a predisposition to A-form DNA, whereas AA steps strongly prefer B-DNA and inhibit A-structures. There is no correlation between flexibility and intrinsic curvature, but bent DNA is less stable than straight. The most difficult deformation is undertwisting. The TA step stands out as the most flexible sequence element with respect to decreasing twist and increasing roll. However, as with the structural properties, this behavior is highly context-dependent and some TA steps are very straight. q 2003 Elsevier Ltd. All rights reserved.

*Corresponding author

Keywords: sequence-dependent structure; DNA flexibility; DNA bentness; octamers; A-form

Introduction Over 90% of the DNA of higher organisms does not code for proteins. However, the DNA sequence determines its 3D structural properties,1 and these properties are used by proteins in numerous ways such as in the packaging of DNA in the nucleus,2 in DNA replication, repair, recombination,3 and in the regulation of gene expression.4 Functional roles for most of this non-coding DNA have yet to be discovered. Sequence-dependent DNA structure is therefore of great importance and has been the subject of much investigation. All-atom models of DNA structure are computaAbbreviations used: RMSD, root mean square deviation; R, purine; Y, pyrimidine. E-mail address of the corresponding author: [email protected]

tionally expensive. Thus, whilst they are very useful in the modeling of specific sequences,5 – 7 they are unsuitable for exploring sequence space. Collective variable models of DNA, having the advantage of a simplified representation of the nucleic acid, can be used to investigate the structure of a large number of sequences. Models of varying degrees of complexity have been described. Early models considered mainly the interactions between the two base-pairs of a single dinucleotide step.8 – 10 Others sought to explain particular properties such as the bending of DNA.11,12 These models incorporated the effects of the composition of a base step but did not consider the surrounding sequence context. Trinucleotide models13,14 which enumerate all possibilities, can begin to include contextual information, but an enumerative approach is clearly limited. We have developed a computational method for

0022-2836/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.

1026

Sequence-dependent DNA Structure

predicting the 3D structure of double helical DNA10,15 – 19 in terms of the six base step parameters (twist, roll, tilt, rise, slide and shift) of the Cambridge Accord.20 The method is based on an ab initio treatment of the base stacking interactions and an empirical model for the backbone and environment. Using a genetic algorithm to search conformational space, we located the global and local energy minima for 30 DNA oligomers (4 – 12 bases in length) whose X-ray structures are available and found good agreement in 24 cases,21 confirming the reliability of the method. Here, we examine the evolution of the structural properties of all possible DNA oligomers as the length of the sequence is increased from two base-pairs to eight base-pairs.

Approach We have calculated and stored the conformational energy maps of all 136 unique tetramers, 2080 hexamers and 32,896 octamers. The collective variable model of DNA used in generating these maps has been described previously10,18 – 20,22 and we refer to these papers for full details. These methods produce a potential energy map based on the values of the slide and shift of the central base-pair of each oligomer. The grid is calculated using 32 values of slide and 32 values of shift. For each pair of central step slide/shift values in the 1024-element grid, the map includes the values of each of twist roll, tilt, rise, slide and shift for every step of the oligomer and the overall energy of the oligomer. In order to analyze the properties of these DNA oligomers, we have generated a database of parameters that define the 3D structural properties of each oligomer in its minimum energy conformation. The parameters include the values of the six step parameters for each dinucleotide step in the oligomer (there are three steps for a tetramer, five for a hexamer and seven for an octamer). We have also included flexibility measures for each of the step parameters. For example to calculate the roll flexibility of an octamer, we plot the values of roll and energy for each element of the 1024-element slide-shift grid (Figure 1). The energy required to adopt a specific value of roll can then be determined by examining the lowest energy grid point on the scatter plot. The flexibility is defined as the force constant for a parabolic fit to these points as illustrated. Thus for each oligomer, we have divided the range of values assumed by parameter p ( p ¼ twist, roll, tilt, rise, slide, shift) at the central step into a number of discrete bins and taken the difference between the minimum energy value in each bin and the overall energy minimum for that octamer. This corresponds to the energy required to distort the oligomer away from its lowest energy conformation to the required value of p. These data were then used to determine the flexibility force constants with

Figure 1. Force curves for roll distortions of TATATATA. All energy/roll combinations from the 1024element potential energy grid, normalised by subtracting the minimum energy value, are shown as dots. The minimum energy values for a given roll were fit to quadratic curves for left (continuous line) and right (broken line) roll to determine the flexibility force constants.

respect to each parameter p assuming a Hooke’s law model: E ¼ kx2 where:



n X ei 2 x i i¼1

n n ¼ number of bins, ei ¼ minimum energy in bin i ˚ or degrees). (kJ mol21), xi ¼ value of p in bin i (A However, the E versus x profiles generated from the conformational maps were not symmetric, so left and right flexibility were treated separately, and two different force constants were determined for each parameter. Thus left (right) flexibility is a measure of how easy it is for the parameter to decrease (increase), with low values of k corresponding to high flexibility. The force constants, k, were then used to determine the partition coefficients, Q, for each degree of freedom as follows: ð1 2 e2ðk=2:5Þx dx Q ¼ 0:5 21

These values provide an alternative measure of flexibility. The partition coefficient is the number of states that are populated at room temperature, so a large partition coefficient corresponds to a very flexible structure and vice versa. The factor of 0.5 deals with the use of different force constants

1027

Sequence-dependent DNA Structure

for decreasing and increasing roll and twist þ þ 2 ðQ2 Roll ; QRoll ; QTwist ; QTwist Þ: To a first approximation, the partition coefficients are independent and so may be summed to give an overall measure, QT ; of the flexibility of each octamer. We have also looked at the overall shape of the oligomers, which we define as the path of the helix axis through the base-pair triads. For a sequence of n base-pairs, this path can be considered as n 2 1 vectors, joining the centers of the base-pairs. The length of the vectors depends on the structure, so we have defined a notional “straight” path for each oligomer, where the vectors are the same length as those of the original octamer but aligned with the z-axis. A leastsquares fitting procedure is then used to calculate the RMSD of the actual path from the straight path. This RMSD parameter is not the same as the angular curvature, and so it is only useful for comparing the structures of sequences of the same length, because the magnitude of the RMSD tends to increase with sequence length. Here, we will only consider the octamer data. Octamers which do not necessarily prefer a B-DNA conformation are unusual and therefore interesting. In this analysis, we have looked particularly at bistable octamers and A-form octamers. We consider that an octamer is bistable, if its conformational potential energy map possesses two distinct energy minima with slide values differing ˚ . GGGGGGGG is an example of a by at least 1 A bistable octamer. A-DNA is characterized by having low slide,22 and we define octamers with a ˚ as A-form. slide value , 2 1 A

Results and Discussion We first consider the overall behavior of the DNA structures as sequence length increases. Next we will focus on the conformations that tend to be assumed by particular step types. Finally we will look at those octamers that seem to be exceptions to the general rules. We will categorize families of tetramers, hexamers and octamers by the type of the central dinucleotide step. Thus, for example, NAAN is an AA tetramer, NNAANN is an AA hexamer and NNNAANNN is an AA octamer, where N is any base. Steps which adopt different conformations depending on the nature of the neighboring bases in the sequence are termed context-dependent. Overall behavior

Figure 2. Slide frequency plots. Slide at minimum energy for all tetramers (- -), hexamers (k) and octamers (—) with relative frequency plotted on a log scale. The slide positions for the ten dimers are shown as diamonds.

ized frequency of occurrence for all tetramers, hexamers and octamers in our database (the tetramer, hexamer and octamer frequencies were normalized by multiplying by 32,826/136, 32,826/2080 and 1, respectively). The values of slide for each of the ten dimers are shown as diamonds. Although the individual dimers have quite different preferred conformations with slide values ranging from ˚ to 3 A ˚ , as the sequence length increases, the 22 A slide of the central step tends to zero, the value found in canonical B-form DNA. Some outliers persist, and these all have low values of slide, as found in A-form DNA. There are 566 sequences which have both A and B character, with a clear change in conformation halfway along the octamer. These sequences feature a run of steps with a strong A-DNA preference followed by a run of steps with a strong B-DNA preference, such as GGGGAAAA. The transition takes place at the GA step that has more ambivalent conformational preferences. No octamers have slide greater ˚ . The average octamer slide is 2 0.3 A ˚ than 0.8 A which is very close to the average value of slide ˚ ) in the database of DNA crystal (2 0.2 A structures23 and consistent with the observation that most mixed sequence DNA tends to adopt a B-type conformation.

Slide For each oligomer, the potential energy conformational map gives the value of slide of the central step in the minimum energy conformation. Slide is allowed to take one of 32 discrete values between ˚ and þ 3 A ˚ . Figure 2 shows a plot of slide at 23 A the central step versus the logarithm of the normal-

Flexibility The two most important parameters for defining the overall flexibility of an oligomer are twist and roll.24 A log normalized frequency plot of the total partition coefficient, QT ; for all roll-twist degrees of freedom is shown in Figure 3 for all tetramers,

1028

Sequence-dependent DNA Structure

for decreasing roll. However, the behavior at the high partition coefficient end is similar for both increasing and decreasing roll. Roll and twist are anti-correlated23,25 and in general, increasing twist and decreasing roll are the most flexible deformations. Untwisting is the most difficult deformation for DNA. Bending

Figure 3. Overall flexibility, measured by the sum of all four partition coefficients, QT, for all tetramers (- -), hexamers (k) and octamers (—) with relative frequency plotted on a log scale.

hexamers and octamers. Flexibility clearly decreases on going from tetramer to hexamer to octamer, i.e. the number of structures with large partition coefficients decreases. Figure 4 shows the log normalized frequency plots of the individual partition þ 2 þ coefficients, Q2 Roll, QRoll, QTwist, QTwist. In general, it is easier to increase twist and more difficult to undertwist. The behavior of roll is slightly different. There are a significant number of sequences that are inflexible with respect to increasing roll (low partition coefficients), and this is not the case

˚ The mean RMSD from a straight path is 0.66 A for all octamers and the standard deviation is ˚ . More than 90% of the octamers have 0.35 A ˚ , so it is clear that the “normal” path is RMSD ,1 A straight B-form DNA. Both the slide frequency and the flexibility frequency plots show that, although individual dinucleotide steps are very different, the coupling of the backbone forces almost all octamers into a B-DNA (slide , 0) conformation, in accordance with experimental observation.25 The effects of sequence context, which are very important at the dimer– tetramer level, are damped as sequence length increases.26,27 An interesting question is whether “bendy” octamers are bent? The two are not necessarily the same,24 and Figure 5(a) confirms that there is no relationship between the flexibility of an octamer as measured by the total partition coefficient and its bentness as measured by the RMSD from a straight path. However, we note an interesting correlation between bentness and the stability of an octamer (Figure 5(b)). For the bent octamers, the stability decreases as the RMSD increases, i.e. DNA prefers to be straight, and there is an energy penalty associated with bending. There are two families of structures. B-form are straight, whilst the bent structures are generally A-form and in this family bending and energy are correlated. Step behavior Here, we analyze the behavior typical of each step type, focusing on the octamers. Stability

Figure 4. Partition coefficients for decrease and increase in twist and roll for all octamers with frequency þ 2 plotted on a log scale. Q2 Roll (—), QRoll (k), QTwist (- -), QþTwist (-·-·)

Table 1A lists the average potential energy per step for all octamers classified according to the identity of the central step, relative to the energy of the most stable step, AA. Our energy calculation does not include a term for the internal energy of a base-pair, i.e. H-bonding interactions. The calculated stability decreases in the order AA . AT . TA and GC . CG q GG, but the mixed sequence steps AG, GA, CA and AC have very similar average stacking energies. The AA step is the most stable with respect to stacking interactions, because it occupies a central conformation which is most compatible with B-DNA. GG is an outlier and is particularly unstable with respect to stacking energy because GG prefers an A-form, low slide conformation. The conformational preferences of GG are in conflict with those required for the

1029

Sequence-dependent DNA Structure

Figure 5. (a) Relationship between octamer flexibility measured as the sum of all partition coefficients, QT, and bentness measured by RMSD from a straight path. (b) Relationship between octamer stability measured as the stacking energy, and bentness measured by RMSD from a straight path.

formation of B-DNA, whilst in A-DNA, the preferences of neighboring steps are unsatisfied. The outcome of this struggle is that GG sequences cannot adopt a structure in which all stacking requirements are optimised and are thus less stable. In general, A-DNA structures have higher energy than B-DNA. Table 1A also lists the base step energies, modified by the addition of 2 4 kJ mol21 per hydrogen bond for each GC base-pair.28 The inclusion of this simplistic representation of hydrogen bonds makes GC the most stable step. There is some correlation between the trends in Table 1A and the results of DNA melting experiments that have been used to deconvolute the stability of individual base steps.29 However, the experimental range of DG values is 20 kJ mol21 with a range of up to 17 kJ mol21 for a given step type in different experimental value sets. Even the trends in different sets are different. It is likely that significant residual intra-strand stacking interactions are present in the melted single-stranded structure, and a detailed comparison of these data with our stacking energies is therefore not sensible. Step parameters The mean values for all six step parameters of the central step for all octamers were calculated for each step type. Experimental values have been determined from databases of DNA crystal structures.23,25 Tilt, rise and shift show very little variation with sequence with average values of 08, ˚ and 0.0 A ˚ , respectively. The mean values for 3.4 A twist, roll and slide are given in Table 1A, and the calculated values generally compare well with the experimental data. One cause of disparity between the calculated and experimental data is that the crystal structure data are limited to a small subset of all possible sequences. For example in the case of GG, A-DNA structures dominate the experimental databases, and if we separate the calculated GG octamer structures into A and B-families, the average calculated parameters for the A-family agree well with the experimental

data. Similarly, Gorin et al. found that CA steps populate two distinct conformational families.25 Their larger group of 15 structures has low twist and high roll (similar to our calculated average values), while the second group of eight structures has extremely high twist and negative roll, which may be caused by crystal packing interactions in these structures.21 El Hassan & Calladine did not distinguish the two groups, and so we cannot compare their results for this step. Flexibility parameters Our approach to analyzing flexibility is somewhat different from that of most previous workers. In previous analyses, the mean roll (or twist) for each step type was measured from a limited number of crystal structures.23,25 Steps with a high standard deviation, i.e. those that vary significantly from one structure to another, were considered flexible, but there are two types of behavior that can lead to this observation: the step adopts two or more well-defined conformations, but with very different step parameters, or the step has a very shallow single energy minimum. Our flexibility parameters are based on all accessible conformational states of the oligomer and therefore represent a composite parameter where either scenario could lead to a low force constant. This allows bistable sequences with more than one energy minimum to be compared with sequences with a single energy minimum in a meaningful way. Table 1B shows the mean and standard deviation for both left and right force constants for both twist and roll for all steps. In general, all steps prefer to over- rather than undertwist as discussed above, and on average, the CA and CG steps are the most difficult to undertwist. However, the standard deviation of the force constants is large for these steps, which means that there are some more flexible sequences that do not follow this general trend. The other step with a high standard deviation for undertwisting is TA, which indicates

1030

Sequence-dependent DNA Structure

Table 1. Analysis of parameters by step type A Mean step parameters Energy

Twist

Roll

Slide

Step

This work

þH bond

This work

E1H

Gorin

This work

E1H

Gorin

This work

E1H

Gorin

AA AC AG AT CA CG GA GC GGb GGa TA

0 7 6 2 6 8 6 7 12 23 3

0 3 2 2 2 0 2 21 4 15 3

37 34 36 38 31 34 38 36 35 32 34

36 33 33 32 37 35 38 37 32 32 31

36 36 31 33 31 31 39 38 33 – 40

3 21 0 26 9 7 5 2 2 5 9

1 3 6 22 2 3 4 22 6 6 12

1 0 3 21 5 7 0 27 7 – 3

20.2 20.4 20.1 20.5 20.3 20.2 20.3 20.6 20.2 21.6 20.3

20.2 20.9 20.1 20.4 1.2 0.0 20.4 0.3 0.7 21.8 20.8

20.1 20.6 20.2 20.6 0.8 0.4 0.1 20.4 20.2 – 0.0

B Mean flexibility parameters by step Twist

Roll

Decrease AA AC AG AT CA CG GA GC GG TA

0.4 0.5 0.4 0.4 1.0 1.2 0.4 0.7 0.6 0.6

Increase

(0.1) (0.1) (0.1) (0.1) (0.6) (0.6) (0.1) (0.2) (0.4) (0.6)

0.4 0.2 0.3 0.3 0.2 0.5 0.5 0.2 0.3 0.3

Decrease (0.1) (0.0) (0.1) (0.1) (0.1) (0.1) (0.1) (0.1) (0.1) (0.1)

0.4 0.3 0.7 0.4 0.3 0.4 0.8 0.2 0.8 0.3

Decrease

(0.1) (0.1) (0.2) (0.1) (0.1) (0.1) (0.1) (0.1) (0.3) (0.1)

0.7 1.1 0.6 0.4 0.9 0.7 0.6 2.1 0.5 0.2

(0.2) (0.4) (0.2) (0.1) (1.3) (0.5) (0.1) (0.9) (0.2) (0.1)

C Extreme flexibility parameters by step Most flexible Twist Decrease AA AC AG AT CA CG GA GC GG TA

Least flexible Roll

Increase

Decrease

Twist Increase

Decrease

Roll

Increase

Decrease

Increase

58 35

30 37

15 73 27

27 13 58

10 10

77 100

26

88

33

A, Comparison of mean step parameters. B, Mean twist and roll flexibilities per step. C, 1% most and least flexible octamers with ˚ ) at respect to change in twist and roll. A, The column headed “This work” contains values for twist (deg.), roll (deg.) and slide (A the minimum energy conformation of each octamer, averaged per step type. The columns headed “ElH” are the mean values taken from a database of 400 crystal structures.23 The columns headed “Gorin” were calculated from 38 B-DNA crystal structures.25 GGb are the B-form and GGa the A-form GG octamers. Energy (kJ mol21) is the mean energy for all octamers of each step type, normalised with respect to the mean energy for all AA octamers; þH bond includes 24 kJ mol21 per GC base-pair. B, For each step type the value given is the mean flexibility force constant k, averaged over all octamers of that step type, low meaning flexible. The figure in brackets is the standard deviation. The highlighted values are discussed in the text. C, The percentage of octamers per step type which are most or least flexible for either decrease or increase in twist or roll (values less than 10% not shown). For example, of those 1% most flexible octamers with respect to increasing roll, 88% of these are TA octamers.

that there are some specially flexible TA sequences (see below). The picture is slightly different for roll at the central step. The GC, AC and CA steps are on average the least able to increase roll, but again the standard deviations are large, so this is not necessarily a feature of all sequences of this class. TA has an unusually low average force constant for increas-

ing roll and a very low standard deviation, so this step appears to be unique in its ability readily to increase roll, in agreement with previous studies.30,31 Table 1B shows that the RR steps, AG, GA and GG, are particularly inflexible with respect to decreasing roll. We also have looked at the 1% most flexible and least flexible octamers with respect to increasing

1031

Sequence-dependent DNA Structure

and decreasing twist and roll. The results are given in Table 1C. The TA step stands out as the most flexible with respect to decreasing twist and increasing roll, which is the distortion that mixed sequence DNA most strongly resists. This is clearly related to the role of this step in the origin of replication.32 Interestingly, this step also features as one of the least flexible with respect to undertwisting in Table 1C which indicates that flexibility properties are strongly context-dependent. The CA step is the most flexible step with respect to decreasing roll in agreement with the analysis of El Hassan & Calladine.23 The AC step stands out as the most flexible with respect to overtwisting, but there are no experimental reports of this phenomenon. It has been suggested that AA steps are relatively stiff.33 However, AA octamers do not feature in Table 1C. They are not particularly inflexible, a view supported by the finding of Olson that the flexibility of AA steps is “comparable to that of other steps”31 and also by studies of nucleosome wrapping.34 The GG step stands out as the least flexible step with respect to reducing roll.

pared with hexamers and tetramers. Only 2% of octamers have different slide values from that of their central hexamer, and these context-dependent sequences involve only 168 hexamers. A comparison between the percentages of context-dependent hexamers and octamers reveals that there are large falls for the two steps which are most context-dependent at the hexamer level, GG and GC, down from 30% to 13% and 10% to 5%, respectively. No steps show any significant increase demonstrating that sequence context effects are damped as sequence length increases, i.e. the structure of the central part of an oligomer becomes essentially independent of the wider sequence context. Bistability is closely related to context-dependence. Table 2 shows the percentage of bistable octamers for each step type. The behavior of the step types parallels that found for context-dependence, because for the octamers, the main factor involved is the switching of the overall structure of bistable GG-rich sequences between A and B-DNA. Bentness The RMSD from a straight path was used to compare the intrinsic curvature of the ten step types (Figure 6). The two extreme types of behavior are AA which is straighter than all other sequences and GG which tends to be more bent than any other sequence. This reflects the conformational preferences for B and A-type structures, respectively.

Context-dependence We have previously classified dinucleotides according to their context-dependence, based on the differences between their dinucleotide conformational map and the corresponding maps of each of the tetranucleotides of which they form the central step, as exemplified by the difference in slide at the minimum energy conformation.19 AA, AT, TA were classified as independent, AC, AG, CA, GA weakly dependent and CG, GC, GG dependent. We have now analyzed the change in slide at the energy minimum as the sequence length increases from dimer to tetramer to hexamer to octamer. Table 2 shows the percentage of sequences whose slide value at energy minimum conformation ˚ when placed in a longer differs by more than 1 A sequence context. The classification above holds reasonably well, but considering their slide-changes in going from hexamer to octamer, it is clear that TA steps should be classified as weakly dependent and AG steps as dependent. Table 2 also demonstrates that context effects are proportionately less important for octamers com-

Special octamers We anticipate that our database of calculated structural parameters will be used in the identification of sequences with special features, for example particularly straight sequences or particularly rigid sequences. Here, we use the database to look in more detail at those octamers that are outliers in various senses. First we analyze which ˚ octamers have an A-DNA structure, slide , 2 1 A for the central step. Then we consider stability, bentness, flexibility and bistability. A-DNA Nine hundred and twelve octamers have slide

Table 2. Re-classification of steps Independent central step Step Tet% Hex% Oct% Bi%

AA 0 0 0 0

AT 0 0 0 0

Weakly dependent central step AC 0 0 0.5 0.3

TA 50 0.7 0.6 1.4

GA 0 0.8 0.6 0.8

Dependent central step CA 100 0.4 1.5 2.3

CG 100 2.9 2.7 4.1

AG 0 3.5 3.4 5.4

GC 60 10 5.0 6.0

GG 100 30 13 23

˚ different from that of the cenFor each step type: Tet% is the percentage of tetramers whose central-step slide value is more than 1 A ˚ different from that of the tral dimer in isolation; Hex% is the percentage of hexamers whose central-step slide value is more than 1 A ˚ different from that of the central central tetramer; Oct% is the percentage of octamers whose central-step slide value is more than 1 A hexamer; Bi% is the percentage of bistable octamers, i.e. those which posses two or more distinct energy minima.

1032

Sequence-dependent DNA Structure

Figure 6. Distribution of octamer bentness, measured by RMSD, by step type. For each central step type, the log frequency of occur˚ rence of RMSD values (in 0.5 A bins) is plotted.

˚ in their central step and are classified as ,21 A A-DNA. The general forms of these octamers are described in Table 3 using the IUPAC nomenclature for nucleic acid substitutions to define consensus sequences.35 All have at least one sequence SSS (S ¼ C or G), and almost all have a longer such sequence, at least SSSS. Of the A-form octamers 50% contain a central GG step, and 10% of all GG octamers have A-DNA as the minimum energy structure, which is too many to give a simple consensus sequence. Due to the bistability of GG octamers, there is a large number of GG sequences that are classified as B-form but also have a local energy minimum A-form structure. GC octamers are also highly represented in the family of A-form sequences, and it is clear from the consensus sequences that GGG, CCC, GGC and GCC are the most important sequence elements that give rise to A-form structures. These results are entirely in accord with the long-held view that nonalternating G,C-rich sequences tend to favor the formation of A-DNA.36,37 However, CCG and CGG are not highly represented in Table 3. The reason is that GC steps have a strong preference for low Table 3. The A form octamers Step type AA AC AG AT CA CG GA GC GG TA

% of A-from 0.0 1.1 1.9 0.2 2.4 3.5 0.5 2.7 10.6 2.3

Consensus sequence – NNBACCCC NNNAGSSS SSSATCCC SSSCANNN SSSCGNNN GGGGANNN NSSGCSNN GGC or GGG SSSTANNN

For each central step type column 2 gives the percentage of ˚ . Column 3 conoctamers of that step type with slide ,21.0 A tains the general form of the A-form octamers where B ¼ not A, S ¼ G or C, and N ¼ any base.

slide A-type conformations, with a relatively wide minor groove,34 whereas the CG steps have a strong preference for a high slide B-type conformation19,38 (and see Table 1A). No AA octamers are represented in Table 3. Moreover, the incidence of AA steps anywhere in the A-form sequences is only 7% and, in two-thirds of these sequences, the AA is a terminal step. Thus it seems that AA steps, particularly within a sequence, mitigate against the entire sequence adopting an A-DNA conformation. Similarly, AT strongly disfavors A-DNA, and there are only four examples of A-form sequences with AT as the central step. In these cases, the conformation is clearly dominated by the flanking SSS sequences. These findings are in good agreement with previous studies of A-philicity and A-phobicity.22,36,38 Stability Table 4 shows the 20 most and least stable octamers, both with respect to their stacking energy alone and with the inclusion of 2 4 kJ mol21 per GC hydrogen bond, normalized so that AAAA AAAA has zero energy. With respect to their stacking interactions, runs of A constitute the most stable structures and runs of G the least stable. Half of the least stable octamers are also A-form and all are bistable, meaning that they posses a second energy minimum which is A-form. When hydrogen bonding energy is also considered, alternating GC runs form the most stable structures, whilst runs of Gs remain the least stable. Again these least stable octamers are either A-form or bistable. Very bent and very straight DNA We have shown that the “normal” path for an octamer is relatively straight. Analysis of the 20 most bent octamers (Table 4) shows that all have

Table 4. Extreme octamers ˚) RMSD (A

Relative energy (kJ mol21)

AAAAAAAT AAAAAATT AAAAATTT AAAATTTT AAAAAAAA AAAAAAAC AAAAAAAG AAAAATAT ATAAAAAT AAAAAATC ATTTAAAT ATTAAAAT GAAAAAAT AAAAAAGT ATAAATTT AAAATTAT ATAAAATT AAAATATT AAATATTT AAATAAAT .. . CTGGGGGG* CAGGGGGG* CCCCAGGG* CCTGGGGG* CCAGGGGG* CCCAGGGG* GGGGGGGA* GGGGGGGG* TCGGGGGG* CGGGGGGA* CCGGGGGA* CCCGGGGA* CCCCGGGA* TCCGGGGG* CGGGGGGG* TGGGGGGA* TGGGGGGG* CCGGGGGG* CCCGGGGG* CCCCGGGG*

Twist (deg.)

Stacking þ Hbond

Stacking 23.6 23.4 23.4 23.4 21.4 20.3 0.0 0.8 0.9 1.0 1.0 1.0 1.1 1.1 1.1 1.1 1.1 1.2 1.2 1.3 .. . 51.5 51.6 51.7 51.7 51.8 51.8 52.3 52.3 52.6 53.0 53.0 53.1 53.1 53.3 54.8 55.4 55.4 55.7 56.0 56.2

GCGCGCGC ACGCGCGC CGCGCGCG AAGCGCGC GCGCGCAC GCGCGAGC GCGAGCGC GCACGCGC GCGCACGC GAGCGCGC AGCGCGCG GCGCGCAT GCGCAAGC GCAAGCGC GCGCGCAG CAGCGCGC GCGCATGC AAAAGCGC AAACGCGC AGCGCGCT .. . TTGGGGTA* TATGGGTA TGGGGATA* TGGGGGTG* AGGGGGGA* TGGGGTGG* TGGGGGAA* TCCACCCA TCCCACCA CCCACCCA* CCCCACCA* TAGGGGGG* TGTGGGGA* TTGGGGGA* TGGGGGGG* TATGGGGA* TAGGGGTA* TAGGGGGA* TGGGGGTA* TGGGGGGA*

217.5 12.9 212.5 211.5 211.5 11.4 211.2 211.2 210.9 210.9 210.9 20.15 29.9 29.7 29.7 29.4 9.4 29.1 29.1 .. 29.1 . 26.2 26.2 26.3 26.3 26.5 26.8 26.8 26.8 26.9 27.1 27.1 27.2 27.3 27.4 27.4 27.9 28.1 29.1 30.1 31.4

Decrease CAAGGGCG CGGGAAGG CGAGGGAA AGGGGAAG* AAGGCCTT AAAAGGGG CCCATGGG CCCGGGAA* CAGGGGCG* AGGAGGGG* CAAAGGGG CAAGGGAA CGAAGGGG CCCGAGGG* CCCAAGGG* CCCCAAGG CTTGGGGG* CAAGGAGG AGGGGGAA* AAAGGGGG* .. . GGGGAGGC* GGGAGGGG* GGGGAGGG* GGGGACCC* GGGAGGGC* GGGGGGGG* CCCGGGGG* GGGGAGCC* AGGGGGGG* GGGGGGGC* CCCGGGGC* ACCGGGGG* GCCGGGGG* GGGGGGGT* GCCGGGGG* GGGAGGCC* AGGGGGGC* GCCGGGGT* AGGGGGGT* ACCGGGGC*

0.16 0.16 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.14 0.14 0.14 0.14 0.14 0.13 0.13 0.13 0.12 0.12 .. . 3.01 3.01 3.01 2.99 2.98 2.98 2.96 2.95 2.95 2.95 2.94 2.94 2.94 2.94 2.92 2.92 2.92 2.91 2.91 2.91

TGGGACCC* CACTAGCC* AGGGACCC* AACTAGCC CGCTAGCC ATCTAGCC AGGTAGCC GGGGAGCG TCGGGGGT* GCTGGGGT* CTGTAGCC GCGGAGCC GACTAGGC* GACTAGGC* GGCTACTA ATGGAGCC GAGGACCC GCCTAAGC ACCGGGGT* GCCTAAAC .. . CCCCGCAT* AACCGGCC* GGCCGCAT* GCCCGCAT* ACGCGGCC* GCGCGGCC* GGGTAAAA* GGGCGCAC* GGGCGCAT* CCCCGCAC* GGCCGCAC* GCCCGCAC* GCCCGCAA* GGCCGGAT* GGCCGCAA* GCCCGGAT* GAGCGGCC* GGCCGGAC* TCGCGGCC* GGGCGCAA*

Roll(deg.)

Increase 0.13 0.13 0.13 0.13 0.14 0.14 0.14 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.16 0.16 .0.16 .. 3.85 3.85 3.85 3.87 3.87 3.91 3.94 3.97 3.98 4.00 4.01 4.01 4.02 4.03 4.03 4.04 4.08 4.12 4.13 4.16

CCCACGGG CCCACCAG CCCACTGG GCCACCCA* GGGACCCA* CCCACCCA* CCCACAGA CCCACACA CCCACATC CCTACCCC* GCCAGGGG* CCCACCCG CCCACAAT CCCACTAG CACGGGGG* CCCACCGC CCCACGGA CCCACCTA CCTACGGG .CCCACACC .. TGACGCAT TAACGCAT GCACGCGT AAACGCAC AATCGCAC CATCGCAT AGGCGCAC TCGCGCAC GAACGCAC AAACGCAT AATCGCAT GGACGCAC GAACGCAT AGGCGCAT TCGCGCAT AAGCGCAT AAGCGCAT GAGCGCAT GGACGCAT GAGCGCAC

Decrease 0.08 0.08 0.08 0.08 0.08 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.10 0.10 .0.10 .. 0.73 0.74 0.74 0.74 0.74 0.75 0.75 0.75 0.75 0.76 0.76 0.76 0.76 0.76 0.76 0.77 0.77 0.77 0.77 0.78

CCCCACTT* ACCTACCC* CCCCACCA* CCCCACTC* GAGTAGCC* GTGCAGGC TGGCAGGC* GGCTACCA* ACGTAGCC* GGCCAGTG* GCCCAGCA* GGCCAGTA* CAATAGCC* CGATAGCC* GCGCAGGC* AGATAGCC* GCCCAGCT* GCCTACGC* GGCTATGA* .AAATAGCC* .. GCAGGACC AGAGGTCC TAAGGACT ATTGGAAC GTTGGAAT TGAGGACC TAAGGACA ATTTGGAAT GTTGGATT ATAGGACC ATTGGATC ATTGGATT GAAGGACT GTTGGATC AAAGGACT GAAGGACA TAAGGACC AAAGGACC GAAGGACC AAAGGACC

Increase 0.06 0.06 0.07 0.07 0.08 0.08 0.08 0.08 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 .. . 1.43 1.44 1.44 1.44 1.44 1.44 1.44 1.45 1.45 1.45 1.45 1.45 1.46 1.46 1.46 1.46 1.46 1.46 1.48 1.48

AGGTAGCC CTATAGCC AAGTAGCC GGCTACTA TGGGACCC* CGGTAGCC CTGTAGCC GCCTATCA GGCTAGTA GCGGACCC GGGAGCCA* GGGGAGCT CTTTACCC GCCTATAA GCCTATGA GCCTATGA ATGTAGCA GGGGAGCG CACTAGCC* .GGGGAGTC .. ATGCACCC* CCCCACTC* GGCCACTC* GGCCAGTG* CCCCACAC* GTGCAGGC* GGCCACAC* CCCCACAT* GTGCAGCC* GGGCATGG* GGCCACAT* GGCCACGT* ATGCAGCC* GGGCATAC* GGGCATAT* GGCCACAG* GGCCACGA* CCCCACTT* GGCCACTT* GGCCATAA*

0.05 0.06 0.06 0.06 0.06 0.07 0.07 0.07 0.07 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.09 .. 0.09 . 11.3 11.4 11.4 11.6 11.8 11.8 11.8 11.9 11.9 11.9 12.0 12.0 12.0 12.1 12.5 12.8 12.8 13.1 13.1 13.3

The top and bottom 20 most and least stable, bent and flexible octamers. Energy is relative to that of AAAAAAAA, both with and without the inclusion of hydrogen bonds. The twist and roll values given are those of the force flexibility constant k for each of the flexibilities (decrease twist, increase twist, decrease roll, increase roll). Bistable or A-form octamers are marked with an asterisk.

1034 ˚ and that sequences of G and C RMSD . 2.8 A bases predominate with only one octamer having fewer than seven G or C bases. As we would expect, there is a correlation between slide at the central step and straightness of the octamer. An A-DNA octamer is bent by definition at its central step. Table 4 also contains the least bent octamers. ˚ . All but three are again All have RMSD # 0.16 A RR octamers, with about half of these being GG octamers, but this time almost all contain a (noncentral) AA step. The fact that some GG octamers are the most bent, whilst others are the least bent, again emphasizes the particular context-dependence of GG steps. Very flexible and very rigid DNA Table 4 also lists the extreme outliers in terms of flexibility with respect to increasing and decreasing roll and twist. We have already looked at the 1% most and least flexible octamers (Table 1C). It is difficult to draw any general conclusions, but there are some sequences that clearly have special properties. For example, CCCAC appears to be associated with extreme flexibility with respect to increasing twist. There are several cases where two very similar sequences have opposite properties. For example, if we compare AACTAGCC with AATTACCC, which has six identical bases in the same position, the first octamer is highly flexible with respect to increasing twist, whereas the second is extremely inflexible in the same direction.

Sequence-dependent DNA Structure

this database clearly shows that most sequences adopt a B-DNA conformation in order to accommodate the preferences of the majority of steps. Thus major conformational variations are diminished as the length of the oligomer increases, and the structures become more uniform and less flexible. We have also shown that runs of nonalternating G,C sequences predominate in A-DNA structures. As GG and GC steps have a preference for low slide, GGC and GGG sequences promote the formation of A-DNA, whereas CG containing sequences do not. There are close connections between sequences which are bistable at the central step, those which form A-DNA, those which are context-dependent and those which are bent. AA stands out as the one step that is never found in an A-DNA conformation. TA stands out as the single most flexible sequence element particularly with respect to untwisting. These conclusions and many of the observations above relating to the stability, geometry and flexibility of different sequences agree well with experimental data from a range of sources and inspire confidence in the validity of the structural database. We are now developing tools to use this database for the analysis of DNA sequence-activity relationships.

Acknowledgements We thank the BBSRC for support of this work and the Wolfson Foundation for provision of computing facilities.

Bistable and context-dependent octamers As described previously, we consider that an octamer is bistable if it possesses two distinct ˚ ) energy (with slide values differing by at least 1 A minima. By this definition 5% (1570) of octamers are bistable. All these octamers contain at least one sequence of the form GGG or GGC or their complements, GCC or CCC. One-third (525) of the bistable octamers are also A-form, and half of the A-form octamers are bistable. Thus 94% of the octamers have stable well-defined B-form conformations, 1% have stable well-defined A-form conformations, and the remaining 5% are bistable populating their A and B-energy minima in a 1:2 ratio. These bistable octamers have context-dependent conformational properties, but for all remaining sequences, the conformation of the octamer’s central step is relatively insensitive to the wider sequence context.

Conclusions We have constructed a database of structural parameters for all DNA octamers. This database includes the values of twist, roll, tilt, ride, slide and shift for each octamer’s step, energy, flexibility parameters and RMSD from straight. Analysis of

References 1. Travers, A. A. (1989). DNA conformation and protein-binding. Annu. Rev. Biochem. 58, 427– 452. 2. Davey, C. A., Sargent, D. F., Luger, K., Maeder, A. W. & Richmond, T. J. (2002). Solvent mediated interactions in the structure of the nucleosome core par˚ ngstrom resolution. J. Mol. Biol. 319, ticle at 1.9 A 1097– 1113. 3. Wold, M. S. (1997). Replication protein A: a heterotrimeric, singlestranded DNA-binding protein required for eukaryotic DNA metabolism. Annu. Rev. Biochem. 66, 61 – 92. 4. Helene, C. & Lancelot, G. (1982). Interactions between functional-groups in protein – nucleic acid associations. Prog. Biophys. Mol. Biol. 39, 1 – 68. 5. Kosikov, K. M., Gorin, A. A., Lu, X. J., Olson, W. K. & Manning, G. S. (2002). Bending of DNA by asymmetric charge neutralization: all-atom energy simulations. J. Am. Chem. Soc. 124, 4838– 4847. 6. Sprous, D., Young, M. A. & Beveridge, D. L. (1999). Molecular dynamics studies of axis bending in d(G(5)(GA(4)T(4)C)(2)-C-5) and d(G(5)-(GT(4)A(4)C)(2)-C-5): effects of sequence polarity on DNA curvature. J. Mol. Biol. 285, 1623–1632. 7. Tsui, V. & Case, D. A. (2000). Molecular dynamics simulations of nucleic acids with a generalized born solvation model. J. Am. Chem. Soc. 122, 2489– 2498.

Sequence-dependent DNA Structure

8. Calladine, C. R. (1982). Mechanics of sequencedependent stacking of bases in B-DNA. J. Mol. Biol. 161, 343– 352. 9. Calladine, C. R. & Drew, H. R. (1984). A base-centered explanation of the B-to-A transition in DNA. J. Mol. Biol. 178, 773–781. 10. Hunter, C. A. (1993). Sequence-dependent DNAstructure: the role of base stacking interactions. J. Mol. Biol. 230, 1025 –1054. 11. Bolshoy, A., McNamara, P., Harrington, R. E. & Trifonov, E. N. (1990). Experimental and computational evaluation of 16 DNA wedge angles. Biophys. J. 57, A454 – A454. 12. Vlahovicek, K., Munteanu, M. G. & Pongor, S. (1999). Sequence-dependent modeling of local DNA bending phenomena: curvature prediction and vibrational analysis. Genetica, 106, 63 – 73. 13. Brukner, I., Sanchez, R., Suck, D. & Pongor, S. (1995). Trinucleotide models for dna bending propensity— comparison of models based on DNAseI digestion and nucleosome packaging data. J. Biomol. Struct. Dyn. 13, 309–317. 14. Crothers, D. M. (1998). DNA curvature and deformation in protein–DNA complexes: a step in the right direction. Proc. Natl Acad. Sci. USA, 95, 15163– 15165. 15. Hunter, C. A. & Lu, X. J. (1997). Construction of double-helical DNA structures based on dinucleotide building blocks. J. Biomol. Struct. Dynam. 14, 747–756. 16. Hunter, C. A. & Lu, X. J. (1997). DNA base-stacking interactions: a comparison of theoretical calculations with oligonucleotide X-ray crystal structures. J. Mol. Biol. 265, 603– 619. 17. Packer, M. J. & Hunter, C. A. (1998). Sequencedependent DNA structure: the role of the sugarphosphate backbone. J. Mol. Biol. 280, 407– 420. 18. Packer, M. J., Dauncey, M. P. & Hunter, C. A. (2000). Sequence-dependent DNA structure: dinucleotide conformational maps. J. Mol. Biol. 295, 71 – 83. 19. Packer, M. J., Dauncey, M. P. & Hunter, C. A. (2000). Sequence-dependent DNA structure: tetranucleotide conformational maps. J. Mol. Biol. 295, 85 – 103. 20. Diekmann, S. (1989). Definitions and nomenclature of nucleic acid structure parameters. J. Mol. Biol. 205, 787– 791. 21. Packer, M. J. & Hunter, C. A. (2001). Sequence – structure relationships in DNA oligomers: a computational approach. J. Am. Chem. Soc. 123, 7399 –7406. 22. Lu, X. J., Shakked, Z. & Olson, W. K. (2000). A-form conformational motifs in ligand-bound DNA structures. J. Mol. Biol. 300, 819– 840. 23. ElHassan, M. A. & Calladine, C. R. (1997). Conformational characteristics of DNA: empirical classifications and a hypothesis for the conformational behavior of dinucleotide steps. Phil. Trans. ser. A, 355, 43 – 100.

1035

24. Widom, J. (2001). Role of DNA sequence in nucleosome stability and dynamics. Quart. Rev. Biophys. 34, 269 –324. 25. Gorin, A. A., Zhurkin, V. B. & Olson, W. K. (1995). B-DNA twisting correlates with base-pair morphology. J. Mol. Biol. 247, 34 – 48. 26. Green, M. M., Peterson, N. C., Sato, T., Teramoto, A., Cook, R. & Lifson, S. (1995). A helical polymer with a cooperative response to chiral information. Science, 268, 1860– 1866. 27. Yanagi, K., Prive, G. G. & Dickerson, R. E. (1991). Analysis of local helix geometry in three B-DNA decamers and eight dodecamers. J. Mol. Biol. 217, 201 –214. 28. Turner, D. H., Sugimoto, N., Kierzek, R. & Dreiker, S. D. (1987). Free energy increments for hydrogen bonds in nucleic acid base pairs. J. Am. Chem. Soc. 109, 3783– 3785. 29. Owczarzy, R., Vallone, P. M., Gallo, F. J., Paner, T. M., Lane, M. J. & Benight, A. S. (1998). Predicting sequence-dependent melting stability of short duplex DNA oligomers. Biopolymers, 44, 217– 239. 30. Travers, A. A. & Klug, A. (1987). The bending of DNA in nucleosomes and its wider implications. Phil. Trans. R. Soc. Lond. B, 317, 537– 561. 31. Olson, W. K., Gorin, A. A., Lu, X. J., Hock, L. M. & Zhurkin, V. B. (1998). DNA sequence-dependent deformability deduced from protein – DNA crystal complexes. Proc. Natl Acad. Sci. USA, 95, 11163– 11168. 32. Watson, J., Hopkins, N., Roberts, J., Weiner, A. & Steitz, J. A. (1998). Molecular Biology of the Gene, 4th edit., Benjamin-Cummings, New York. 33. Simpson, R. T. & Kunzler, P. (1979). Cromatin and core particles formed from the inner histones and synthetic polydeoxyribonucleotides of defined sequence. Nucl. Acids Res. 6, 1387–1415. 34. Satchwell, S. C., Drew, D. R. & Travers, A. A. (1986). Sequence periodicities in chicken nucleosome core DNA. J. Mol. Biol. 191, 659– 675. 35. Cornishbowden, A. (1985). Nomenclature for incompletely specified bases in nucleic-acid sequences— recommendations 1984. Nucl. Acids Res. 13, 3021 –3030. 36. Basham, B., Schroth, G. P. & Ho, P. S. (1995). An ADNA triplet code—thermodynamic rules for predicting A-dna and B-dna. Proc. Natl Acad. Sci. USA, 92, 6464 –6468. 37. Peticolas, W. L., Wang, Y. & Thomas, G. A. (1988). Some rules for predicting the base-sequence dependence of DNA conformation. Proc. Natl Acad. Sci. USA, 85, 2579– 2583. 38. Tolstorukov, M. Y., Ivanov, V. I., Malenkov, G. G., Jernigan, R. L. & Zhurkin, V. B. (2001). Sequencedependent B $ A transition in DNA evaluated with dimeric and trimeric scales. Biophys. J. 81, 3409 –3421.

Edited by Sir A. Klug (Received 25 March 2003; received in revised form 5 August 2003; accepted 11 August 2003)