J. Mol. Biol (1986) 190: 619-633
Micrococcal Nuclease as a DNA Structural Probe: Its Recognition Sequences, Their Genomic Distribution and Correlation with DNA Structure Determinants James T. Flick-f-, Joel C. Eissenberg and Sarah C. R. Elgin Department of Biology, Washington Box 1137, St Louis, MO 63130, (Received
18 June 1985, and in revised form
University U.S.A. 10 March 1986)
We have analyzed micrococcal nuclease (MNase) DNA cleavage patterns at the sequence level by examining 2.3 x lo3 base-pairs of data derived from the Drosophila, melanogaster 44D larval cuticle locus. Within this region, MNase preferentially cleaved 140 sites. Clusters of these sites appear to generate the preferential MNase eukaryotic DNA cleavage sites seen on agarose gels at roughly 100 to 300 base-pair intervals. These clusters of preferential cleavage sites rarely occur within gene coding regions. The analysis revealed that duplex DNA sequences preferentially cleaved by MNase are generally determined by a single strand sequence: d(A-T),, where n > 1, flanked by a 5’ dC or dG. Cleavage of the other strand is generally staggered 5’ by several nucleotides and occurs even if such sequences are absent on that strand. An empirical predictive DNA cleavage model derived from a statistical analysis of the sequence level data was applied to seven eukaryotic gene loci of known sequence. The predicted patterns were in good general agreement with the previously observed eukaryotic gene/spacer cleavage pattern. Statistical analysis also revealed that sites of predicted preferential DNA cleavage occur less frequently in protein coding regions than for randomized sequences of the same length and nucleotide content. Comparison of the MNase cleavage patterns to the sequence-dependent pattern of binding energies between duplex DNA strands indicates that MNase preferentially clea‘ves sequences with low helix stability.
may be a major causative agent (Calladine, 1982). A recent nuclear magnetic resonance study revealed evidence for the validity of this DNA structure model in solution (Pate1 et al., 1983): offering further support for its potential biological significance. Other work has focused on additional higher-order structural features, such as the sequence dependence of DNA helix axis bending (Wu & Crothers, 1984), or helix axis flexibility (Hogan et al., 1983; Chen et al., 1985). A number of DNA cleavage agents may reco’gnize such B-DNA structural features. DNase I appears to preferentially cleave regions of DNA associated with local helical twist (Lomonossoff et al., 1.981). Recent work by Drew (1984) showed that DNase I preferentially cleaves duplex DNA in a mitnner suggesting that it recognizes a feature of the minor groove. A study of the Escherichia coli Tyr promoter DNA, digested with DNase I, DNase II, and the orthophenanthroline-cuprous complex, indicates that these agents recognize DNA structural features determined by several contiguous nucleotides (Drew & Travers, 1984.). Micrococcal nuclease may also recognize certain specific DNA structural features of this type. While
1. Introduction
One goal regarding gene regulation is to determine features of DNA structure that are important for the binding of structural or regulatory proteins. Such features may reflect interactions with specific chemical groups of DNA either in a manner dictated primarily by the local structure of single nucleotides, or in a manner dominated by higher-order structural interactions determined by strings of nucleotides. It is now clear that. B-form DNA has considerable structural heterogeneity that might be recognized by DNA binding proteins. For example, X-ray crystallographic analysis of a DNA dodecamer has indicated that local helical twist is sequence-dependent (Dickerson & Drew, 1981). Application of a DNA structure model accounting for such phenomena has led to the observation that unfavorable steric interactions between purines on adjacent base-pairs t Author to whom all correspondence should be addressed to Department of Laboratory Medicine, Medical University of South Carolina, 171 Ashley Ave, Charleston, SC 29425, U.S.A. 0022~2836/86/160619~15
$03.00/O
619
0
1986 Academic Press Inc. (London) Ltd.
620
J. T. F&k% et al.
it does not cleave near a unique sequence or at a specific base, a recent agarose gel resolution study of the Drosophila melanogaster heat shock locus 67Bl revealed that MNaseT preferentially cleaved regions at roughly 100 to 300 base-pair intervals throughout the intergenic spacers, with little or no cleavage within the five protein-coding regions (Keene & Elgin, 1981). An extension of the analysis to 18 genes of D. melanogaster and two mammalian genes revealed an analogous pattern, with regions of preferential cleavage occurring within long introns as well as in the intergenic spacers (Keene & Elgin, 1984). Other recent MNase digestion studies have revealed the same gene/spacer pattern in the D. melanoqaster histone gene locus (Udvardy & Schedl, 1983), and in the 44D larval cuticle protein locus (Eissenberg & Elgin, 1983). Studies with the chemical cleavage agent orthophenanthroline-cuprous complex ((OP),Cu’) have shown a very similar pattern of cleavage at the agarose gel level of resolution (Cartwright & Elgin, 1982; Jessee et al., 1982; Eissenberg & Elgin, 1983). A closely correlated pattern has also been seen in an analysis of the interaction of topoisomerase II with purified Drosophila DNA (Udva,rdy et al., 1985). These observations suggested that MNase might recognize DNA structural features of a general nature, rather than a property specific to MNaseDNA interactions. A detailed determination of the MNase recognition sequences might therefore lead to some observations regarding a biologically relevant pattern of DNA structure. A number of studies addressing the question of an MNase recognition sequence have been published (Horz & Altenberger, 1981; Dingwall et al., 1981). These studies revealed the complexity of MNase DNA cleavage, and the need for analysis of a greater amount of sequence-level information. We present results demonstrating that regions of eukaryotic DNA preferentially cleaved by MNase are determined by single sites or clusters of sites defined by a few specific sequence classes. A predictive DNA cleavage model was developed from statistical analysis of 2.3 kb of sequence-level cleavage data obtained from the D. melanogaster larval cuticle locus 44D. The predictions of the model show good general agreement with MNase cleavage patterns of a given locus analyzed at the agarose gel level of resolution. Further, application of the model to the DNA sequences of seven eukaryotic loci predicts the gene/spacer pattern that has been generally observed (Keene & Elgin, 1984). 2. Materials (a) Digestion
of DNA
and Methods
and determination
of cleavage sites
I.9 kb BarnHI-HindIII fragment of the h). ~LeZanogaste~ 44D larval cuticle locus. This region extends from the 3’ end of gene II, and includes a pseudogene. This region was chosen because it bad already been sequenced (Snyder et al., 1982; Eissenberg & Elgin. 1983). and because it contains 7 preferential MNase cleavage regions and a 500 base-pair segment relatively free of preferential cleavage, as shown by analysis of agarose gel data (Eissenberg & Elgin, 1983). DN4 fragments were generated wit,h an appropriate restriction enzyme (New England Biolabs), 3’ end-labeled with the Klenow fragment of E. coli DNA polymerase I (Roehringer;Mannheim) and digested with a second restriction enzyme. Isolated DNA fragments with a single end-label were lightly digested with 0.15 unit MNa.se/ml (Worthington Labs, ?l’lilipore Corp.) for 1 min at 25°C in XNase digestion buffer (60 rnx-KCl, 15 rnM-NaCl, 15 m&rTris . HCl (pH 7.4), 0.5 mw-dithiothreitol, 1 rn*r-Call,, 6.25 M-sucrose (Wu et al., 1979)). The digestion was performed in the presence of excess pBR322 DNA (0.2 fig/$) to ensure that all samples were lightly digested. to the same extent so that’ the resulting cleavage patterns of different DNA fragments could be considered to reflect relative cleavage probabilities. Estimates of single nucleotide cleavage probabilit’ies can be determined when the probability of more than one cleavage per mole&e is very low. The choice of the 3’ end-label and use of 25°C for digestion was made to minimize potential artifacts due to MNase 3’ exonuclease activity (Horz & Bltenberger, 1981). After 1MNase digestions, the DNA was extracted with phenol, heat denatured, and electrophoresed in a denaturing 5”/0 (w/v) acrylamide gel (8 Murea. 1 x TBE (89 m2Jr-Tris-base, 89 mivr-boric acid. 3.8 m&r-EDTA) (Maxam & Gitbert, 1980)). Different MNase digestion samples were standardized by loading the same amount of radioactivity into each well. The reaction products of purine-specific cleavage (G and G f A reacbions of Maxam & Gilbert (1980)) performed on the same fragment were run simultaneously on adjacent lanes of the gel to determine the position of MNase cleavage. Gel autoradiographs were made using Kodak XAR-5 X-ray film for 24 h at -20°C without an enhancing screen. (OP),CLl’ digestion patterns were analyzed in an identical fashion; digestion conditions used were those described by Cartwright & Elgin (1982). MNase digest>ions of the same DNA fragment were placed in adjacent lanes of the gel so that direct compa,risons could be made. (b) Comparison of sequence level data to agarose gel data
Preferred MNase cleavage regions determined at the agarose gel level of resolution were related to sequencelevel data by employing a gel of intermediate resolution. End-labeled digested duplex DNA fragments were run in a 5% to 15% non-denaturing acrylamide gradient gel (Jeppeson, 1974; Barnes, 1979). Position markers were generated by digestion of the parent fragment with Bau3a and TaqI. This method gave an average relative spatial resolution of approximately 10 nucleotides, permitting good alignment of preferentially cleaved agarose gel regions with sequence level da,ta.
The DNA for this analysis was obtained from a pUC9 subclone of IDMLCP2 (Snyder et al., 1982), containing a (c) Method
T Abbreviations used: Mh’ase, micrococcal nuclease; kb, lo3 bases or base-pairs. For clarity sequence hyphens are omitted throughout this paper.
of
scoring
single
Tzucleotide cleavage
probabilities
The autoradiographs were analyzed with a scanning densitometer (Helena Labs). Relative single nucleotide
Micrococcal
Nuclease
DNA
cleavage probabilities (PI)7 were determined by the integration of individual DNA fragment peaks (Lutter, 1978). Under ideal conditions using sequential loadings of successive lanes, a sequence of 600 nucleotides could be examined on a single gel. Data from different gels could be compared because of the standardized method of MNase digestion, gel electrophoresis and autoradiography. (d) Statistical
methods for identify&g recognition sequences
MNase
A method for identifying the recognition (or “consensus”) sequences preferentially cleaved by MNase using the cleavage patterns of protein-free DNA must meet certain requirements. First, the method must accommodate clusters of individual sites, since analysis of MNase DNA cleavage data indicates that the enzyme cleaves between several nucleotides when preferential cleavage occurs (in contrast to restriction enzymes, for example). The relative magnitude of the cleavage probability for a given short sequence is used to identify potential recognition sequences. Second, the method should generate a statistical weight for each sequence examined, as MNase is known not to have a simple recognition sequence but rather recognizes a class of sequences. Third, a number of observations suggest that although the duplex DNA structure recognized by MNase leads to a single strand cleavage, it is followed by a rapid cleavage of the second strand offset 5’ by several nucleotides. The method used should be minimally affected by this complication, as will be the case when short test sequences are used. Two simple statistical methods were chosen to best meet these requirements. The first method, a maximum likelihood search, is simply to scan the single nucleotide cleavage data from the sequencing gel autoradiographs for all occurrences of a specified short test sequence of length m. Cleavage
t The following terms will be defined for use in the text. Single nucleotide cleavage probability, PI: the probability that MNase cleaves a phosphodiester bond 5’ to a given nucleotide on a single strand of duplex DNA. Preferentially cleaved site: a MNase single-stranded cleavage site is defined by having at least one PI greater than a specified minimum background level. The total probability (Psite) for a test sequence of length m is defined as the sum of all contiguous single nucleotide cleavage probabilities within m, each. greater than t,he minimum. A site is predicted to be preferentially cleaved if Psite > Psite,th where Psite,th is a selected threshold value. Test sequence cleavage probability, Pjk (m, seq): the sum of all observed MNase single nucleotide cleavage probabilities for the kth occurrence of a test sequence (seq) of length m, starting at base position j and ending at j + m - 1. The mean test sequence cleavage probability for all occurrences within the data set is P,,,(seq)= (Pj,(m, seq)), where ( ) indicates calculation of the average. Recognition sequence: a particular DNA test sequence of length m having a mean cleavage probability P,,, (seq) greater than a specified minimum threshold level, Pm.*,,. Preferentially cleaved region: the liocation of duplex DP;A sequences preferentially cleaved by MNase, observed and mapped at the agarose gel level of resolution.
Recognition
621
Xequences
probabilities for each occurrence Pj,(m, seq) are adding the determined by respective cleavage probabilities for each position within that sequence m. Recognition sequences are identified by calculating a mean cleavage probability for each test sequence (Pjk(m, seq)) and by then ranking all the sequences tested (of length m) by that mean cleavage probability. Errors due to flanking sequence effects, sample size: etc. are estimated by calculation of the standard error of the mean. The second method, sequence extension, has been used to examine further sequences of interest identified by the first method. Dinucleotide pairs frequently represented among preferentially cleaved sequences (identified from the maximum likelihood search described above) were selected, and a single nucleotide was added either 5’ or 3’ to the sequence. Additions that revealed a significant enhancement of the cleavage probability were retained, and another nucleotide was added. Statistical confidence levels are calculated in the same manner as the first method. While this causes the number of alternatives to increase as 4” after the nfh step, this did not present problems in the present analysis since MNase appeass to have a fairly well-defined set of recognition sequences (see Results and Discussion). A detailed explanation of the mathematical formulae used in this analysis and. the theoretical considerations involved are available frorn the first author (J. T. Flick).
3. Results and Discussion (a) Identijcation preferentially
Identification
and qualitative analysis cleaved by MNase
of sites
of sequences preferentially cleaved of protein-free duplex DNA
by MNase digestion was begun by locating
several preferentially
cleaved
regions at the level of agarose gel resolution. This was achieved by comparison of the agarose gel. and acrylamide gradient gel data with the sequencing gel data. A typical result is shown in Figure 1, revealing the relationship between three preferentially cleaved agarose gel regions and the clustered, preferentially cleaved sequence level sites. Interestingly, there is close agreelment between the densitometry profiles of the sequencing gel and the non-denaturing acrylamide gradient gel. This suggests that cleavage of the second strand generally occurs within a few nucleotides of the first strand cleavage site. This is consistent with other work, which revealed that during the early stages of DNA digestion with MNase, relatively little ch.ange is observed in the cleavage pattern with time (Modak & Beard, 1980; Horz & Altenberger, 1981; Keene & Elgin, 1981). Preferentially cleaved regions identified at the agarose gel level of resolution were found to reflect one or more (i.e. a cluster of) preferentially cleaved sites when resolved at the single nucleotide level. Figure 2(a) shows the MNase DNA cleavage patterns associated with six preferentially cleaved agarose gel regions identified in Figure l(a). Analysis of the entire 2.3 kb of sequence level data revealed a total of 140 preferentially cleaved sites comparable in amplitude to the sites comprising the six regions, thus greatly increasing the amount of
622
J. T. Flick
et al.
(b)
Nucleotide
position
Figure 1. Scanning densitometry of acrylamide gradient gel and sequencing gel autoradiographs of MNase-digesbed (a) Agarose gel data end-labeled DNA restriction fragments within the 44D larval cuticle locus of D. melanogaster. showing the positions of preferentially cleaved regions within the 1.9 kb region examined between gene I and gene IT. (b) Acrylamide gradient gel data. (c) Sequencing gel data showing the preferentially cleaved sites. Bars under the sequence level data indicate the approximate extent of the regions seen in the agarose gel (a). Numbers indicate relative nucleotide positions, as numbered in the GenBank (T.M.). Acrylamide gradient gel data were positioned relat’ive to the sequencing gel data by using Z’aqI-Sau3A restriction fragments from the same DNA.
data available for analysis. Typical examples of these sites are shown in Figure 2(b). Two general features are apparent. First, a number of sites have the following “core” recognition sequences: 5’ CTAT . ., GTAT . . ., CATA . ., GATA . ., where the ellipses indicate a pattern of extending alternating AT?. Second, cleavage of the strand opposite the core sequence generally occurs with a 5’ overhang of as many as five nucleotides, suggesting cleavage across the major groove. This can be observed even in the absence of a “core” recognition sequence on the opposite strand, as shown by analysis of the cleavage of both strands over the same 600 base-pair region. Thus, it appears t’hat preferent’ial MNase cleavage can be related to these sequences on either one or both strands. Many of the apparent observed exceptions to the core sequences may therefore reflect staggered cleavage on the strand opposite a core sequence. T For the sake of clarity, the deoxy “d” prefix will be suppressed in this paper. No reference is made to ribonucleotides, so no confusion will arise.
(bj Statistical analysis of thg experi-mentally determined cleavage probability data: sequen,w, recognized by MXase A statistical analysis of the MNase DKA cleavage data was necessary for a number of reasons. First, the qualitative analysis indicated that there were a significant number of DNA sequences cleaved that did not include the core sequences described above. Second, t,he development of a predictive model of DXA clea,vage required a quantitative ranking of relative cleavage frequencies (cleavage probabilities). The testing of such a model should be useful in confirming the correctness of the identified recognition sequences. Third, the quantitative ranking of cleavage frequencies for a large n-umber of DNA sequences should prove useful in identifying DNA structural features associated with Mn‘ase cleavage preferences. Two complementary statistical methods were chosen for this analysis. a maximum likelihood search and a sequence ext’ension method; both are described in Materials and MethodsP section (d). The results of an analysis using a maximum
IV GATTACTACTACCAC~ATACTAGGGTAATCCG CTAATCATCATGCTCATAT~ATCCCATTAEC
C&TAAAYAT*TATAT*cTTTAT*TcGTGT**AcEcTcc
CAAAATRTTATATAT~AAATA~CACATTTCCCCACC I III 'I L I I
I
II1
V TTTTCTTTATGCACACATATACTCATTTTTGATAAATG AAAACAAATACCTGTG=i'AGTAAAAA~?AC
-i-l I
__ _... Cn;CCTTCTATTGC~-TATCACACAATATAATAACACAACTACA CACGCAA~ACCAATACTGTGTTATATTATTCTGTTGATGT
III
III I
(
I
I
VII
lllllll, I
InI
II
CACACATTTTTACTGCTATAACAAGTGCATGTATT CTGTCTAAAAATCAC~TTCGTTCACCTAC~A
TTTACTTATCTCGGGTAAAAAAAACTTTATTG~ATAATATATAACATGT AAATCAATACAGCCCATTTTTTTTGAAATAACTTATTATATATTGTACA
41
III’
lb)
III Ill
II
II II II
III
CTAAAAGTTACTCGTTCGGGACGC~*TGACATTRTTTATTTdAGGTCCATTTGCAdRICACAAACA~~CAAACAAATTCC~~GTC~~ATT CATTTT~AATGACCAAGCCCTGCCAA GTAATAAATAAACTCCACGTAAACGTCTGTCTGTTTGTTGGTTTGTTTAA~TCAGAATAA II
8
II
II
I
Ill
I
I1
II
I I III
I
II I I
IN
I
III1
I
I
I
I
I
I
II@
Ill
III
TTAAAAACACAGAATGCTATACTCAG CTGG(AIAGT(G~GACdCCTATATTTCAA~ACCTTT~GATATC~~AAAAAI~~GAA.ACA 4ArnTTCTCTCTTACGATATCAGTCGACCTTCACACTGGCCATATAAAGTT~AAAAACTAGT
II
I
I
III
II
11:
I
1‘1111
III
lllll
Will I I Ill1 Ill ,llOl,llli II I I GACCTCTGAAATCTTCATATCtAATCTTAACTCTCTAGTTT~GAAGTCClAGG~~CAT~CGAACATACTCT~TATT~ATC~TT CTCGAGACTTTAGAAGTATAGATTAGAATTGAGAGATCAAAAACTTCAGCTCCGCAAGTATGCTTG~~AGAC~ACTA~GAA I
I II1
llll
TGTAAACGCCTCCTTCTACCTGl’TA
Figure 2. (a) Sequence level MNase cleavage patterns for the preferentially cleaved agarose gel regions shown in Fig. 1, Relative 5’ nucleotide cleavage probabilities (PI) were determined by scanning densitometry and integration of each single nucleotide peak. The amplitudes drawn are proportional to a score of 1 to 10 cleavage probability units. Site VI is omitted as the sequence level position was not established. Data were obtained for both DNA strands at site I but for only one strand at the other sites. Note the postulated core recognition sequences (5’ to 3’): CTAT ., GTA.T . ., CATA ., GATA . ._ The first 4 nucleotides of these sequences are underlined. (b) Typical MNase cleavage probability patterns from a continuous stretch of DNA near region I.
Preliminary
Table 1 sea,rchfor four-nucleotide sequences cut preferentially by MNase
Sequence CTAT
CTAG CCTX ATAT? TATAt GTAT GCTAl
CGTA$ CATA CTTA
GATA
V4’4)
Number of sites scored
9.0+ 1.0 9.0+ 1.4 8.9,0.9 7.5kO.7 7.0*0+? 6.9+0.9 6.7 + 2.0 5.7ltrO.8 5.6kO.7 4.8&o+ 4.8kO.7
4 4 3 27 28 20 4 10 12 7 10
(c) Deletions and additions to the suggested core recognition sequences
Crude maximum likelihood search results for 4-nucleotide sequences. The Table includes the 11 preferentially cleaved test sequences having (I’+.> greater than P4,fh, where the threshold P 4,1hhas the value 4-8 as indicated in Fig. 3(a). t Sequences later shown to be recognition sequences, but with a cleavage probability comparable to GATA (see the text and Fig. 4). $ Sequences observed to be cleavage subsets of other core sequences (see the text).
likelihood search for the 256 four-nucleotide test sequences are shown in Table 1 and Figure 3(a). Note that the four core sequences identified by inspection rank among the first 11 as being preferentially cleaved. The graphical presentation of ranked results in Figure 3(a) displays the degree of deviation from sequence-neutral cleavage, which would have resulted in a horizontal line (i.e. equal cleavage probability for all test sequences). (A single recognition sequence would be represented by a vertical line at the origin.) A three-nucleotide sequence search generates a similar pattern, as shown in Figure 3(b). Searches with sequences longer than four nucleotides could not be performed with our data base, as many of the sample sizes are too small for adequate statistical analysis. Further, a maximum likelihood search for a core recognition would sequence longer than four nucleotides probably not resolve the relevant sequences as well, due to inclusion of the staggered opposite strand cleavage events. Interestingly, t,hree and fournucleotide searches show the same relative cleavage probability ranks for the core sequences: P(CT..
.) > P(GT..
.) > P(CA.
~ .) > P(GA.
.);
where the ellipses indicate continuation of an alternating AT pattern. This suggests that the uncertainties in ranking order are less than implied by the standa,rd errors of the mean, indicated by shading in Figure 3(a) and (b). This result cannot be due simply to sampling the same preferentially cleaved sites in the three and four-nucleotide of sites increases searches, as the number dra.matically
as the test sequence
length
probability threshold P4,th to identify those sequences responsible for t,he agarose gel pattern; this is shown in Figure 3(a). Self-consistency tests support the validity of this critical threshold choice?.
is reduced.
Having established on a statistical basis that a sequence preference exists in a four-nucleotide test sequence search, one must define a cleavage
The El four-nucleotide sequences having mean cleavage probabilit#y (P4) greater tha,n the threshold cleavage probability (P,J defined in Figure 3(a) are shown in Table I. This choice of threshold leads to inclusion of the four core sequences previously observed (CTAT, GTAT, CATA and GATA), as well as a number of other sequences. In generating a descriptive model, each of the other sequences can be either identified and described as a cleavage “subset” of one of the core sequences or must be included separately as a distinct MNase recognition sequence. Inspection of the original data shows that ?-The choice for P4,th in Fig. 3 and Table 1 can be checked by applying standard tests for sensitivity (S) and positive predictive value (PPV). S is defined as the percentage of preferentially cleaved sites (Psit, > Psi,& that are characterized by t,he presence of a 4-base recognition sequence. A high value of S (i.e. loo:/,) would indicate that the set of chosen recognition sequences was large enough to identify all preferentially cleaved sites. The PPY is the percentage of all occurrences of a DNA recognition sequence that is actually associated with an observed preferentially cleaved site. A high PPV would indicate that the presence of a recognition sequence correctly predicts a preferential MNase DKA cleavage site. One may vary cIeaved Psite, fh so that different classes of preferentially sites defined by Psi,, > Psi,e,t,,can be examined. Note that P4,th cannot be compared directly to Psife,thas the preferentially cleaved site length is not restricted to 4 nucleotides in the latt’er case. Bs Psite,,h is raised, S rises to an observed asymptotic maximum characteristic of the chosen set of recognition sequences. For the set of 4 core recognition sequences, the maximum value of G for the data set examined here is only TO%, even for stringent conditions, where only the most prominent. 14 of tbe visually evident 140 preferent~ially cleaved sites are being scored. If the complete set of recognition. sequences described in this paper are included, X > 95% in scoring the complete set of 140 sites. Therefore. the choice of P4,,h in Fig. 3(a) and Table 1 used to generate t,he set of recognition sequences is not, too high. In contrast to the consequences for the sensitivity test,, mising Psite,threduces the PP Y. However, reducing P4,Lhreduces the PPY by increasing the number of recognition sequences included in the model. All but one of the 4-nucleotide core recognition sequences has a PPV > Xl!/,, considering all 140 preferentially cleaved sites. The PPV iB, however, 100qb for al1 Snucleotide core sequences except GATAT, even when only the most prominent, 14 sites are considered. Results for 5nueleotide core sequences with single alternating AT string breaks have somewhat lower PPVs, as expected. Thus, while one cannot firmly establish the value of P 4. th as being as high as possible, it is clear that P4,:h should not be reduced.
Micrococcal
Nuclease
DNA
Recognition
Sequences
625
the sequence extension method of analysis to selected alternating AT sequences lacking a neighboring 5’ G or C on either strand (see Fig. 4), reveals their rank as roughly comparable to the least .preferentially cleaved core sequence GAT, GATA, GATAT, etc., of the same length. Thus, although isolated ATAT, TATA should be included as recognition sequences defined separately from the core sequences, these sequences are cleaved with a lower preference than CTAT, GTAT or CATA.
(0)
(ii) Core sequences with brealcs in the alternating AT pattern QI .T?
l/256
641256
I/64
16/64
1281258
32/64 Sequence
192/256
48164
2561256
64/64
rank
Figure 3. Mean cleavage probability data for (a) 4-nucleotide sequences (256 possibilities) and (b) 3-nucleotide sequences (64 possibilities). Mean values were derived from data like those displayed in Fig. 2 and
ranked to indicate the dependence of cleavage probability on sequence. For the sake of clarity, probabilities are plotted continuously, with standard errors of the mean indicated by shading. A selected threshold (P& is marked for the 4-nucleotide search case with an arrow. Tests on the validity of this choice are discussed in the footnot’e to p. 624.
where GCTA is part of a preferential it overlaps
the
sequence
CTAT,
and
cleavage site, therefore
it
does not make an independent contribution to the model and can be eliminated. Similarly, CGTA occurs as a subset of GTAT sites. The necessary additions to the tentatively identified core sequences, as well as many test sequences lying just below the threshold can be divided into three main classes: (1) alternating AT; (2) sequences similar to core sequences, but with single insertions of an A or T in the alternating AT string (i.e. “breaks”), giving GTTA, CTTA, etc.; andL (3) an AT or TA sequence flanked by a 5’ and 3’ C or G: CTAG, GTAC, etc. (i) Alternating AT The vast majority of the occurrences of the sequences ATAT, TATA are within core recognition both DNA strands. one or sequences on Examination of core sequences having stretches of alternating AT shows long uninterrupted consistently enhanced cleavage near the 5’ G or C end, even for an AT stretch of as many as eight nucleotides (data not shown). This indicates that while the presence of alternating AT is a significant feature, it alone is not sufficient to explain the observed MNase cleavage profiles. Application of
Table 1 reveals that MNase preferentially cleaves GTTA. If sequences with single “breaks” in the alternating AT sequence following CT, GT, CA and GA are examined, a pattern develops for preferential cleavage, slightly reduced from the four core recognition sequences (see Fig. 4). The effect of such a break appears greatest near the 5’ G or C, with additional breaks reducing cleavage probabilities further (results not shown). Sequences with a single break should therefore be consid.ered for inclusion as recognition sequences. (iii) AT or TA flanked by a 5’ and 3’ G or C Examination of CTAG, and other similar less prominently cleaved sequences (CTAC, GATC, etc.) reveals that the relative cleavage preferences are consistent with the known three nucleotide maximum likelihood search ranking P,(CTA.. .) > PJGTA.. .) > P,(CAT.. .) > P,(GAT . . .), if contributions from both strands are included. The preferentially cleaved sequence CTAG shown in Table 1 appears to be a special case where the three nucleotide core recognition sequence CTA occurs on both strands. Other four-nucleotide sequences of this type had cleavage probabilities much less than P 4,th, and were therefore not included. Figure 4 reveals another feature of the MNase recognition sequences: the longer the stretch of alternating AT, the greater the probability that a sequence will be preferentially cleaved. This explains observation recognition WhY core sequences of three nucleotides do not reliably predict the MNase cleavage pattern, despite their success in reflecting mean cleavage probabilities. However, such sequences longer than four nucleotides do reliably predict MNase cleavage. (d) Comparison of the proposed micrococcal n&ease recognition sequences with prior work
These observations on the sequence specificity of MNase cleavage of protein-free DNA are in general agreeuient with and extend the work of earlier analyses. Horz & Altenberger (1981) examined approximately 20 preferentially cleaved sites in a 5’ CATA satellite DNA sequence, and proposed 3, GTA’T and 5’ 3, CTA C,AT as promi;nent recognition I -_--
sequences. They
J. T. Flick
626
BL ai
6 CT..,
4 2
GT..
a CA...
I6
TA...
,”
6 4 2
2
16
?n 0 z al E
AT...
14 12 CA...
GA...
10
a 6
“4 2
4
23456
2
GA...
Sequence
length
(R 1
60)
Figure 4. The effect of alternating AT sequence length, or breaks in the alternating AT pattern on the cleavage probabilities of core recognition sequences. This represents an application of the sequence extension method. These cleavage probabilities were used in the predictive model of MNase cleavage. The vertical axis is relative cleavage probability, and the horizontal axis sequence length, as the alt,ernating AT string is extended. Sequences included in the analysis were carefully chosen to avoid the flanking effects of a 3’ C or G or alternating AT string longer than the length specified. The lines indicate trends between different sequence lengths, with error bars showing the standard error of the mean. (0) Values based on a single data point; no error bars can be given. (a) Isolated recognition sequences with perfect alternating AT strings. Note the general agreement,with the ranking assignments of Table Ii where corrections were not made for flanking sequence effects. (b) Isolated recognition sequences with single breaks, insertions, of an A or T into t,he alternating AT pabtern, i.e. CTTA, CTAA instead of CTAT, etc. (c) Isolated alternating AT.
observed one preferentially cleaved core recognition sequence containing five nucleotides of alternating AT. Dingwall et al. (1981) noted that among five preferentially cleaved sites in a 5 S RNA gene repeat, CTAG was unusually prominent. Razvi et al. (1983) demonstrated preferential cleavage of six sites containing sequences of alternating AT of two or more nucleotides flanked by a 3’ or 5’ G or C. They also observed the 5’ overhang of opposite strand cleavage. Other investigators have observed this staggered cutting in MNase chromatin digestions (Cockell et al., 1983). (e) A predictive
model for MNase
cleavage patterns
A test of t.he validity of the proposed core recognition sequences was made by developing a computer program based on a simple empirical model of preferential cleavage of DNA by MN&se. Briefly, the method was to scan a duplex DNA sequence for the proposed core recognition sequences four to seven nucleotides in length on at least one strand, where the cleavage probability for
each of t’hese sequences (CTAT _ .) GTAT . .; CATA . ., GATA .: tbe ellipses indicating extension of the alternating AT) was taken from Figure 4(a). Other recognition sequences were included in the model in the following manner. The sequence CTAG was assigned a mean cleavage probability of 9.0 (see Table 1). Sequences containing “breaks” in the alternating AT sequence were not, scored directly but were included in a simpiified manner by assigning a cleavage probability for each separate three-nucleotide long alt,erna,ting AT occurrence, using the values shown in Figure 4(c). Agarose gel resolution was approximated by averaging the resultant cleavage probability profile with a normal (Gaussian) distribution of 30 nucleotides half-width. The success of the model depends in part on inclusion of an adequate set, of core sequences, which in turn depends on the selection of P4+ illustrated ia Figure 3(a); the choice made here is supported by self-consistency tests. A comparison of the predictive model to the MKase cleavage pattern on agarose gels for the
Micrococcal N&ease
627 -
DNA Recognition Xequences
80%
60% Cd)
(c) OO%/ Predicted
(b)
-
Observed
(0)
Cuticle
locus
Figure 5. Predictions of the empirical predictive MNase cleavage pattern model compared to results on agarose gels for the D. melanogaster larval cuticle gene region. (a) Known prominent MNase cleavage sites lying between genes I and III (brackets) are indicated with filled arrowheads. (b) Predictions of the model for the cuticle gene DNA sequences. For clarity, only cleavage probabilities greater than 0.7 arbitrary unit are shown. (c) A profile of ATAT-TATA occurrence frequencies shows significant agreement with the cleavage profile. (d) A profile of AT base composition is included for comparison.
D. melanogaster cuticle locus is shown in Figure 5. A simple tabulation of AT content does predict the (compare location of the genes, as anticipated Fig. 5(a) and Fig. 5(d)). A much better fit to the MNase cleavage pattern is obtained by plotting the occurrence of alternating AT (see Fig. 5(c)). This map shows a correlation with both the position and intensity of the MNase cleavage sites (see in particular the region between gene I and the pseudogene, emphasizing that not only base composition but also the arrangement of the nucleotides is important (see also Eissenberg & Elgin, 1983)). Th e results obtained with the more detailed model based on the sequence level observations of MNase cleavage are shown in in Figure 5(b). The model appears effective predicting positions of preferentially cleaved DNA regions (generally marked by clusters of preferentially cleaved DNA sequences), but is less predicting successful in relative cleavage probabilities. Examination of specific sequences predicted to be preferentially cleaved suggests at
least two reasons for this. First, comparison of the cleavage probability data with predicted experimental observations reveals that recognition sequences with “breaks” in the alternating AT pattern are given too little weight (e.g. compare cleavage probabilities in Fig. 4(a) with Fig. 4(b)). Second, only mean cleavage probabilities for each recognition sequence can be assigned, ignoring specific flanking DNA effects. Despite these limitations, a comparison of model predictions to agarose gel data for the human beta globin locus (Keene & Elgin, 1984) reveals a remarkably good agreement (Fig. 6). The predictive model of MNase DNA clea,vage can also be tested by examining known 1DNA sequences for the features of the MNase clea,vage patterns
experimentally
observed
by Keene
& YE&in
(1984), namely the relative cleavage preference of MNase for regions flanking protein coding sequences in eukaryotes, and the pattern of cleavage at roughly 100 to 300 base-pair intervals within these spacer regions. A number of eukaryotic gene loci
J. T. Plick et a?
628
r
t fs
Humon
beto globin
locus
Figure 6. A comparison between (a) agarose gel data and (b) predicted MNase DNA cleavage profiles for the human fi-globin gene (Keene & Elgin, 1984). Features are indicated as for Fig. 5.
long regions of flanking DNA had been sequenced were therefore selected from the GenBank library (T.M.: Bolt, Beranek & Newmann) for analysis. The sequences used were from: (1) D. melanogaster: the larval cuticle protein locus (8.7 kb), the HDL locus (9.7 kb), and the alcohol dehydrogenase locus (2.1 kb); (2) human: the a interferon locus (9.7 kb), the p interferon locus (1.8 kb), the y interferon locus (6 kb), the growth hormone locus (2 kb); and the p globin locus (4 kb); and (3) mouse: the glandular kallikrein locus (9 kb). Analysis of these sequences using the predictive model revealed that preferentially cleaved regions occurred in the spacer DNA and in several long
where
introns, but rarely in gene coding regions. Predicted distances between preferentially cleaved regions in the non-coding DNA were typically 100 to 300 bases, independent of the choice of a cutoff over a range of values above and below the seleet,ed threshold (0.7 arbitrary unit) utilized in Figures 5 and 6 (results not shown). The model predictions of preferred MNase DNA cleavage therefore agree in general wit,h earlier experimental observations (Keene & Elgin, 1984; Udvardy & SchedI, 1983). The gene/spacer pattern predicted by the model can also be analyzed by plotting the mean frequency of predicted preferential cleavage regions (or sites at the agarose gel level of resoludionj, as a function of cutoff threshold choice from 0.6 to 0.9 arbitrary unit. This has been done for the eukaryotic DNA sequences listed in the above paragraph and the prokaryotic sequences listed in the Figure legend (Fig. 7(a) and (b)). Not,e the low frequency of preferentially cleaved sites predicted within eukaryotic gene coding regions. Figure 7(a) also reveals that the frequency of such sites in eukaryotic coding regions is significa,ntly reduced from the frequency for randomized sequences with the same length and single-strand nucleotide composition (P value <0.05), suggesting the existence of a constraint reducing the occurrence of MNase recognit,ion sequences. While the frequency of MNase recognition sequences in eukaryotic noncoding DNA is also reduced relative to random expectations, it’ is by a lesser amount (P value
).b
Threshold (a)
v,r
U.8
0.5
level (b )
Figure 7. Comparison of >INase cleavage model predictions for coding and non-coding DNA. The number of MSase cleavage sites predicted per kb is plotted against a threshold from 0.6 to 0.9 arbitrary unit. (a) Eukaryotic DKA sequences (see the text for listing). (b) Prokaryotic/phage DPL‘A coding regions (E. coli Iac locus, Fd pha,ge, 4x174). Results with phage 1 and pBR322 (not shown) are similar. (a) Predictions derived from the original DXA sequences. (0) Predictions derived from randomized DXA sequences having the same length and single-strand nucleotide composition. Error bars indicat’e plus and minus 1 standard deviation.
Micrococcal Nuclease DNA Recognition Sequences recognition sequences (Fig. 7(b)). Analysis (not shown) of bovine papilloma virus type I (8.9 kb) and simian virus 40 (SV40: 5.7 kb) coding regions also showed no such constraint. A comparative study of non-coding DNA regions could not be made with these sequences, as there is little noncoding sequence in these genomes. Thus, if our small data base has not misled us, a constraint reducing the probability of MNase cleavage within coding regions is present and limited to eukaryotic chromosomal genes. We performed a direct confirmatory test of the above observation by scanning the eukaryotic coding sequences for the occurrence frequency of alternating AT strings. This is acceptable as a first approximation to the MNase recognition sequences, as the majority of such strings can be shown statistically to be flanked by a 5’ C or G on at least one strand. Statistical analysis revealed that the occurrence of alternating AT strings of two to six bases in length is reduced by a factor of 3 to 5 from random expectations within the eukaryotic gene coding regions with a net P value of much less than 0.05 (results not shown). In contrast, analysis of alternating AT strings in the eukaryotic non-coding regions or in viral and bacteria,1 coding regions again revealed little deviation from random expectations.
829
One would like to know whether or not the constraint observed here in eukaryotic coding sequences can be related to observed constraints on codon usage. A calculation of the predicted frequencies of MNase cleavage sites within ge:nes based on observed codon usage (Grantham et al., 1981) gave results that were not sufficiently different from random expectations to explain the experimental observations. It has been reported that in some instances a pattern favoring purineN-pyrimidine codons is observed. However, this pattern is reported to be prominent in certain viral and prokaryotic coding regions, and less so in eukaryotic coding regions (Shepard, 1981a,b), in contrast to what is seen here. A recent statistical analysis of codon usage in eukaryotes has reveatled that the third base choice of a codon may be limited by additional constraints determined by the following codon (Lipman & Wilbur, 1983). Although the results are general and do not give information on specific DNA sequences (an inherent weakness of conventional information the:ory (Shannon, 1948a,b; Gatlin, 1972)) this constraint on eukaryotic gene coding regions may be related to the relative lack of MNase recognition sequences. Lipman & Wilbur (1983) note that this codon usage constraint is much less prominent in prokaryotes, consistent with the observations reported here.
AAATATATATACT
CACCTGTTACATAATTTCTAGTTA
MN
Figure 8. Comparison of the (OP),Cu’ (OP) MN ase cleavage patterns at the sequence level. Typical shown for region I, the region indicated by the filled bar. Note that while MNase and (OP),Cu’ preferentially same region, individual nucleotide cleavage probabilities are different.
results are cleave the
630
J. T. Plick
(f) MlVase recognition
sequences: D,NA structural correlations
(i) The similady of MNase cleavage patterns
and (OP),Cu”
DNA
The similarity of the orthophenanthrolinecuprous complex ((OP),Cu’) and MNase DNA cleavage patterns at the level of agarose gel resolution suggests that these agents recognize a common DNA structural feature. Recent analyses of (OP),Cu’ cleavage patterns at the sequence level revealed that the sequence most prominently cleaved was a long double core sequence, 5’ CATATC 3’ (Drew & Travers. 1984). In a,nother study, the three sequences most preferentially cleaved by (OP),Cu’ contained . . . TATC . . (Jessee et al., 1982). We have obtained a similar result; the three clusters of sequences most preferentially cleaved by MNase in a 400 base-pair fragment (regions I, II and III) are the three sequences most preferentially cleaved by (OP),Cu’. The result for region I is shown in Figure 8. It is important to note that the MNase and (OP),Cu’ do not cleave between individual nucleotides with the same frequency (probability) (see also Cartwright St Elgin, 1982), and have fundamentally different mechanisms of DNA cleavage. (ii) Correlations DNA
of MNase recognition structural determinants
et ai
performed on non-repetitive DNA sequences well below T,,,, and under salt conditions different from those of the melting experiment,s. (Salt condit.ions are known in some cases to alter high-resolution melting patterns (Gotoh, 1983).) However, the same correlation is evident using T, data obtained under different salt, conditions. MNase digestions of four repetit,ive DNA sequences (Dingwall et al., 1981) for which T, data are available also show a consistent pattern (section (3) of Table 2). This correlation between reduced DNA melting temperatures and preferential MNase cleavage for a limited number of sequences can be extended to any sequence by using dinucleotide duplex DNA binding energies (ent,halpy or AH) or T, values calculated by the statistlical mechanical method of Gotoh (1983) (see also Saenger, 1984). This method has been used to accurately predict high-resolut,ion melting profiles for a number of DNA fragments up to 4 kb in length (Gotoh, 1983). Figure 9 shows a typical predicted T, profile compared to MNase cleavage data for a segment of the D. melanogaster cuticle protein locus DNA near region I. Sites of preferential MNase DNA cleavage are marked by local reductions in the dinucleotide T, values. The reason for this correlation is t’hat AT and TA
sequences with
The similarities of MNase and (OP),Cu’ cleavage patterns suggest that these reagents recognize a DNA structural feature of a general nature. The diversity of the MNase recognition sequences also points to the general nature of the structural features recognized. MNase cleavage of isolated alternating AT (no 5’ G or C on either strand) shows a broad pattern of cleavage sites; extending generally over the entire alternating AT string. If a 5’ G or C is present, the total cleavage is enhanced: and the peak shifted to within a few nucleotides of the 5’ G or C (see examples in Fig. 2). However, as indicated in Figure 4, the total cleavage probability is directly dependent on the length of alternating AT. This suggests that two DNA structural features may contribute to preferential DNA cleavage by MNase, one contributed by strings of alternating AT, and t,he other a boundary feature determined or marked by a 5’ G or C. An interesting comparison between MNase cleavage data of the type listed in Table 1 and the available DNA melting data (helix stability) for a number of repetitive sequences is shown in Table 2. Section (1) includes data which demonstrate that the four-nucleotide sequences with the highest alternating AT content are preferentially cleaved and have the lowest T, value. The data in section (2) demonstrates a possible correlation with a second feature of the MNase recognition sequences: T,(CTA . .) is less than T,(CAT. . ,); T,(GTA , . .) is less than T,(GAT . . .), suggesting a reason for the cleavage preference of a T over an A 3’ to a G or C. It should be noted that the MNase digests were
Table 2 Correlations
between MNase
cleavage probabilities
and T, data
Mean Sequence cleaved
relative cleavage probability
(1) ATAT TATA XBTA ATAA TAAT TATT AAAA TTTT
7.5+0.7 5.0+0+3 4.1* 0.4 2.5f0.4 3.2 + 0.6 4.OkO.6 0.6&0.2 1.1 io.3
ATAT
AAAA
48
61
(2) CTA CAT GTA G$T
3-9*043 2.410.5 2.7*0.4 1.4kO.3
CTACTA CATCAT. GTAGTA GATGAT
63 68 63 68
73 76 33 76
ATAT AAA,4 GGGG GCGC
39 48 82 95
54 61 100 110
(3) ATAT AAAA GGGG GCGC
l%+E 1.45 0.09 10.04
Repetitive sequence melted
T, (“C) in !7, (“C) in 10 mM-Na’ 50 mx-Xa+ 39
AATAAT
54 57
Correlation between published DNA melting data and MNase cieavage probabilities. For convenience only one DNA strand is displayed, 5’+ 3’; and 4-nucleotide sequences are grouped to demonstrate apparent sequence-dependent features of MSase cleavage. (1) The effect of alternating AT content. (2) Yr~feren~ial MSase cleavage with a T: &ther than an 8, followinu a 5’ C or G. (31 Comparisons of repetitive DNA sequencis cleaved by Mlia& (from Dingwall et k, 1981) and their respective T,,, values. These latter clea,vage probabilities are not directly comparable to those reported in this paper, used in pa& (1) and (2) of this Table. The cleavage probability errors given are the standard errors of the mean. The T,, data are from Grant, et al. (1972), Wells ei al. (1970) and Wells & Blair (1967). 7’,,,errors are typicaliy k 1 deg. (‘.
Micrococcal N&ease DNA Recognition Xequences
II I III1
631
II I
CGAAACdAATAATAATATCAAAAAAA~ATTCAA~PICTTCAAAACdTdTGACdcTCA~
Figure 9. Dinucleotide T,,, (equivalent to DNA strand binding energy) predictions from the Gotoh (1983) model are compared to MNase cleavage data taken from analysis of the D. melanogaster 44D cuticle protein locus. The DNA segment shown was near region I.
dinucleotides have unusually low T,,, values relative to other dinucleotide sequences, even AA and TT.
The Gotoh (1983) predictions (and the experimentally derived T,,, data) do not reveal much correlation with the cleavage enhancing and focusing effect of a 5’ C or G. Dinucleotide T, predictions derived from the Ornstein 8: Fresco (1983) empirical potential model of DNA melting correlate less well with data on MNase cleavage. Some further insight into the nature of MNase DNA cleavage preference can be derived from a nuclear magnetic two-dimensional resonance experiment done on a MNase recognition sequence duplex (Pate1 et al., 1982). Two different under the same dodecamers were examined experimental conditions: (1) GCGCTTAAGCGC, and (2) GCGCATATGCGC. Evidence for localized melting (breathing) was obtained by determining the relative exchange rates of individual imino HI protons with water protons. Analysis of these two sequences at 4”C, well below the T,,, value, revealed that breathing was much more prominent in the center of molecule (2), which contains a core recognition sequence on both strands, than in molecule ( 1) . MNase recognition sequences may be regions of high DNA flexibility. Recent studies of DNA in solution have revealed that some sequences are considerably more flexible than others (Hogan et al., 1983; Chen et al., 1985). Cben et al. (1985) have shown from transient electric dichroism studies that is poly(A, T) . poly(A, T) unusually flexible, with a persistence length of roughly one half that
of natural
“random”
sequence
DNA.
Since
poly(A;T) * poly(A, T) also has an unusually low helix stability, it is tempting to speculate that there may be a correspondence between reduced helix stability and increased DNA flexibility. (For a see Manning (1983) .) theoretical discussion, However, it is important to note that low helix
stability may be only one of several criteria for predicting increased DNA flexibility. The observation that sequences preferentially cleaved by MNase may also be regions of low helix stability raises the possibility that MNase recognizes certain structural features of singlestranded DNA. While it is very likely that the initial kinetic phase of MNase-DNA bin’ding involves interaction with duplex DNA, it is possible that in a laber recognition stage the enzyme binds to a single strand structure before cleavage occurs. It is not clear which of these stages is sequence specific. MNase cleavage of single-stranded DNA has been studied in considerable detail by enzymatic and X-ray crystallographic analysis (Tucker et al., 1978, 1979; Dieters et al., 1!382). Interestingly, the activation energy is thought to be identical for both single-stranded and doublestranded cleavage, indicating that the rate-limiting step (probably the last, hydrolysis) is the same (Tucker et al., 1979). Further, X-ray crystallographic analysis of this (17,000 molecular weight) protein bound to a molecule of pdTp lead to the suggestion that the enzyme has a small hydrophobic pocket binding only to a single strand of DNA. Similar analysis of the binding of MNase to duplex DNA, or to longer sequences has not yet been performed. Finally, analysis of MNase singlestranded DNA cleavage products has revealed that MNase preferentially cuts 5’ to an A or T and :not a G or C, in agreement with duplex DNA cleabvage observations. An enhancing effect by a 5’ C or G (or any 5’ base) was not reported in the single-strand MNase cleavage studies (Tucker et aZ., 1978, 1!)79). A recent study of DNA digestion by MNase has the questioa of attempted to address directly whether single-stranded structures play a role in MNase DNA cleavage (Drew, 1984). The same 20 base DNA sequence was cleaved by MNase in duplex form, and in a self-annealed single strand
632
J. T. Flick
form with a stem-loop st’ructure. In two out of three cases the same DNA sequence was preferentially cleaved in both the double and single strand form, suggesting that MNase must recognize at’ least some features of single-stranded DNA before the final cleavage occurs. In one case, a site was preferred in the stem-loop configuration. Unfortunately, the data base is small, and does not include the MNase recognition sequences defined in this paper. Duplex DNA structural features are likely to be the sites preferentially important in defining cleaved by MNase. For example, the 5’ C or G enhancing and focusing effect is apparently observed only in MNase cleavage of duplex DNA. If the final cleavage mechanism depended only on the presence of sequences that are easily melted, then of alternating AT should be less regions prominently cleaved in the presence of a flanking 5’ G or C. A recent X-ray crystallographic study of recognition another MNase sequence d(GGTATACC) (Shakeed & Rabinovich, 1983) revealed some duplex DNA structural features that might be recognized by MNase. The analysis suggested that this sequence may be associated with a corresponding opening in the major groove, suggesting a correlation with the experimentally observed 5’ C, G boundary effect. This result was attributed to stacking interactions between the G and T keto groups. Further, the structure of poly(A, T) . poly(A, T) DNA in solution is currently unresolved (e.g. see Assa-Munt & Kearns, 1984; Gupta et al., 1983), and may differ significantly from classical B-DNA. A general explanation of the correlation between preferential MNase cleavage and reduced local helix stability may thus lie in the ability of these regions to form an alternative DNA structure. Although the specific features of this structure cannot yet be defined, it would seem likely that bending or twisting of the B-DNA axis, opening of the major or minor grooves or local distortions of single base positions may be necessary for optimal MNase DNA binding and cleavage. New DNA structure models need to be developed to explain the nature of these observations. (g) MNase recognition and biological
sequences: summary implications
MNase core recognition sequences, determined by a detailed statistical analysis of 2.3 kb of DNA cleavage data, generally consist of alternating AT of at least two bases with a 5’ G or C, on at least one duplex DNA strand. Clusters of these sequences appear to explain the previously observed preferential MNase cleavage sites observed in eukaryotic non-coding DNA. Interestingly, these sites occur less frequently in eukaryotic gene coding regions than expected for a randomized single strand sequence having the same nucleotide content, even when codon usage frequencies are taken into account. It thus appears that the
et al
observed marked reduction in MNase cleavage of eukaryotic protein coding regions (Keene & Elgin, 1981: 1984) is not only due to increased G i-6 eontent, but is a,lso due to a constraint reducing the frequency of these sequences. Xt is a challenge to speculate about the significa,nce of this constraint, which does not appear to be present in the prokaryotic, the bacteriophage, or eukaryotie viral gene coding sequences studied in this paper. Likely possibilities include specific translational or chromatin structural requirements. The DNA structures preferred by MNase are associated with two general features: (1) a region of low helix stability determined by a string of alternating AT; and (2) a cleavage-enhancing boundary feature associated with a 5’ C or G. It is possible that the region of low helix stability may be unusually flexible, permitting bends in the DNA helix that are necessary for optimal attachment of certain DNA-binding proteins. The actual proteinDNA contact points need not be at the sites of such helix axis bends, and may lie elsewhere. The DNA sequence patterns observed in this work may thus be involved with the organization of DNA into a compact chromatin structure. We are grateful to Dr Graham H. Tlnomas for sever& helpful discussions and for developing one of the computer programs used in the analysis. We thank Dr Peter Dunham for statistical advice, and Drs Iain Cartwright, Tom Tullius and Paul Sharp for critical reviews of the manuscript. This work was supported by NIH grant GM-30273 to S.C.R.E. One of us (J.C.E.) is supported by an PU’IH postdoctoral fellowship (2’3%GMOWX4). J.T.F. is supported by an NH postdoctoral fellowship (F32-GM10205).
eferences Assa-Nunt, P;. & Kearns, 5). R. (1984). Biochemisfry, 23, 792-796. Barnes, W. M. (1979). Gene, 5, 127-139. Calladine, C. R. (1982). J. ,Vol. Biol. 161, 3433352. Cartwright, I. L. & Elgin, S. C. R. (1982). Nucl. Acids Res. 10, 5835-5852.
@hen,H. H.; Rau, D. C. & Charney, E. (1985). *J.Biomol. Xtruct.
Dynam. 2, 709-719.
CockelI, M., Rhodes, D. & Klug, A. (1983). J, Mol. Biol. 170; 423-446. Dickerson, R. E. & Drew, H. R. (1981). J. Xol. Biol. 149, 761-786. Dieters, J. A., Galluci, J. C. & Holmes, R. R. (1982). J. Amer. Chem. AYOC.104, 5457-5465. Dingwall, C., Lomonossoff, 6. P. & Laskey, R. (1981). Nucl. Acids Res. 9, 2659-2673. Drew, H. R. (1984). J. Nol. Biol. 176, 535-557. Drew; H. R. & Travers, A. A. (1984). Cell, 37, 491-502.
Eissenberg, J. C. & Elgin, S. C. R. (1983). Mol. Cell. R&2. 3; 1724-1729. Gatlin, L. (1972). Inforrmtion Theory a?zd the I&&q System, Columbia University Press, New York. Gotoh, 0. (1983). Adwan. Biophys. 16, 4-52. Grant, P. C., Kodama, M. & Wells, R. D. (1972). Biochemistry, 11, 805-815. Grantham, R., Gautier, C., Gouy, M., Jacobzone, $1. & Mercier, R. (1981). NucZ. Acids Res. 9, r43-r79.
Micrococcal Nuclease DNA Recognition Xequences Gupta, G., Sarma, M. H., Dhingra, M. M., Sarma, R. H., Rajagopalan, M. & Sasisekharan, V. (1983). J. Biomol. Struct. Dynam. 1, 395-416. Hogan, M., LeGrange, J. & Austin, B. (1983). Nature (London), 304,152-154. Horz, W. & Altenberger, W. (1981). #ucl. Acids Res. 9, 2643-2658. Jeppeson, P. G. N. (1974). Anal. Biochem. 58, 195-207. Jessee, B., Gargiolo, G. & Worcel, A. (1982). Nucl. Acids Res. 10; 5822-5833. Keene, M. A. & Elgin, S. C. R. (1981). Cell, 27, 57-64. Keene, M. A. & Elgin, S. C. R. (1984). CeZE,36, 121-129. Lipman, D. J. $ Wilbur, W. J. (1983). J. Mol. Biol. 163, 363-376. Lomonossoff, G. P., Butler: P. J. G. & Klug, A. (1981). J. Mol. Biol. 149, 745-760. Lutter, L. C. (1978). J. Mol. Biol. 124, 391-470. Manning, G. S. (1983). Biopolymers, 22, 689-729. Maxam, A. M. C Gilbert, W. (1980). Methods Enzymol. 65, 499-560. Modak, S. P. & Beard, P. (1980). Nucl. Acids Res. 8, 26652678. Ornstein, R. L. & Fresco, J. R. (1983). BiopoZymers, 23, 1979-2000. Patel, D. J., Kozlowski, 8. A., Ikuta, S., Itakura, K., Bhatt, R. & Hare, D. R. (1982). SCoZdSpring Harbor Symp. Quark Biol. 47, 197-206. Patel, D. J., Kozlowski, S. A. & Bhatt, R. (1983). Proc. Nat. Acad. Sci., U.S.A. 80, 3908-3912.
633
Razvi, R., Garguilo, G. & Worcel, A. (1983). Gene, 23, 175-183. Saenger, W. (1984). Principles of Nucleic Acids Structure, Springer-Verlag, New York. Shakked, Z. & Rabinovitch, D. (1983). J. Mol. Biol. 166, 183-201. Shannon, C. (1948a). Bell System Tech. J. 27, 379-423. Shannon, C. (1948b). Bell Systems Tech. J. 27, 623-656. Shepherd, J. C. W. (1981a). J. Mol. Euol. 17, 94-102. Shepherd, J. C. W. (1981b). Prac. Nat. Acad. Sci., U.S.A. 78, 1596-1600. Snyder, M., Hunkapiller, M., Yuen, D., Silvert: D.: Fristrom, J. & Davidson, N. (1982). Cell, 29, 10271040. Tucker, P. W., Hazen, E. F. & Cotton, F. A. (1978). Mol. Cell Biochem. 22, 67-77. Tucker, P. W., Hazen, E. F. & Cotton, F. A. (1979). 1MoZ. Cell Biochem. 23, 67-85. Udvardy, A. & Schedl, P. (1983). J. Mol. BioZ. 166, 1599 181. Udvardy, A., Schedl, P.; Sander, M. & Hsieh, T. S. (1985). Cell, 40, 933-941. Wells, R. D. & Blair, J. E. (1967). J. Mol. Biol. 27, 273288. Wells, R. D., Larson, J. E., Grant, R. C., Shortle, R. E. & Cantor, C. R. (1970). J. Mol. Biol. 54, 465-497. Wu, C., Bingham, P. M., Livak, K. J., Holmgren, R. & Elgin, S. C. R. (1979). Cell, 16, 797-806. Wu, H. M. & Crothers, D. M. (1984). Nature (London), 308, 5099513.
Edited by M. Gottesman