J. theor. Biol. (1986) 118, 295-300
A Measure of D N A Periodicity B. D. SILVERMANAND R. LINSKER
I B M Thomas J. Watson Research Center, Yorktown Heights, N e w York 10598, U.S.A. (Received 18 March 1985, and in revised form 25 September 1985) A Fourier transform g(n) of a sequence of bases along a given stretch of DNA is defined. The transform is invariant to the labelling of the bases and can therefore be used as a measure of periodicity for segments of DNA with differing base content. It can also be conveniently used to search for base periodicities within large DNA data bases.
Introduction The sequence of bases along any given stretch of DNA may hold information concerning potential amino acid sequences of proteins yet to be formed, regulatory signals to be imparted to enzymes intimately involved in the processes of transcription or translation and other signals basic to functions that are at present not clearly understood. One interesting and ubiquitous feature of eucaryotic DNA is the presence of tandem (periodic) as well as interspersed base sequence repeats throughout the genome. Such repeats vary in length from "simple sequence" dinucleotide periodicities (Smith, 1976; Federoff & Brown, 1978; Sures et al., 1978; Slighton et al., 1980; Cohen et al., 1982; Heller et al., 1982; Gebhard & Zachau, 1983; Jeang & Hayward, 1983; Maroteaux et al., 1983; Tantz & Renz, 1984) to the interspersed longer Alu or Kpn type sequences (Britten & Kohne, 1968; Houck et al., 1979; Cheng et al., 1981; Fox et al., 1982; Singer, 1982 Carroll et al., 1984; Schimenti & Duncan, 1984) Since the characterization and search for DNA sequence information has proceeded exclusively in the nucleotide domain (Soil & Roberts, 1982), i.e. with respect to specific DNA sequences, we thought that it might be illuminating to determine whether the techniques of Fourier analysis might be of use in the specification and/or examination of a given stretch of DNA. The presence of periodicities suggested that such analyses might be of value. The present paper is therefore devoted to a description of how one might "Fourier analyze" a sequence of bases along DNA and to illustrate the use of this procedure with several examples.
Fourier Transform Let us consider a segment of DNA consisting of 2M bases in extent (Fig. 1). We are limiting ourselves to the search for periodicities within the stretch from - M + 1 to M however M is of course arbitrary. We do this by considering a discrete Fourier transform of this sequence of bases in a manner to be defined. The problem of assigning values for the base occupation of the different positions is a nontrivial 295 0022-5193/86/030295+06 $03.00/0
O 1986 Academic Press Inc. (London) Ltd
296
B. D. SILVERMAN AND R. LINSKER .... T
M -M+I . T G G p ~
.
. A
.
. T
T
.
. C
. T
A
G
0 C
1 C
2 G
-% C
. C
.
. G
. C
. . C T
T ~
M " " " T G 9
FIG. 1. 2M bases along a DNA strand. one. We w o u l d like to o b t a i n a F o u r i e r t r a n s f o r m that is i n d e p e n d e n t o f the labelling o f the bases, e.g. the t r a n s f o r m s h o u l d yield the same n u m b e r , i n d i c a t i n g high c o n t e n t o f say, periodicity three, w h e t h e r the base s e q u e n c e has m a n y simple repeats o f the type ggcggcggc.., or aagaagaag . . . . With such t r a n s f o r m one c o u l d m a k e c o m p a r i s o n s b e t w e e n different segments of D N A with differing base c o n t e n t . t This p r o p e r t y of the t r a n s f o r m can be achieved in the following m a n n e r . At each base l o c a t i o n m we assign a u n i t vector f ( m ) that p o i n t s in o n e of four directions, from the center to the corners of a regular t e t r a h e d r o n . T e t r a h e d r a can be conv e n i e n t l y t h o u g h t to be centered at the various base positions (Fig. 2). T h e direction o f the u n i t vector at each p o s i t i o n is d e t e r m i n e d by the p a r t i c u l a r base at that position, i.e. g, c, a, or t.
G
G
A
T
T
C
FIG. 2. f(m) as a function of base content along a DNA strand. Each o f the vectors f ( m ) can be resolved a l o n g three m u t u a l l y o r t h o g o n a l axes given by the u n i t vectors sc,/2 a n d ~. The F o u r i e r series of the c o m p o n e n t s o f f ( m ) a l o n g each o f these axes can be written M A
--
f~=~.f(m)=
Y.
~
Cne
n 7rim //g-/
,
etc
(1)
-M+I
with F o u r i e r coefficients, C~, etc., given b y 1
M
C~ = 2 M ,,=-M+,v~ f ~ ( m ) e -"~'m/M,
etc.
(2)
t This work was initially motivated by a preprint of the paper by C. A. Pickover. Frequency spectra of DNA sequences: application to a human bladder cancer gene, Journal of Molecular Graphics 2, 50 (1984). Pickover Fourier analyzes the base content of the T24 oncogene with use of a three dimensional power spectrum representation similar to that frequently used in the field of speech analysis. He assigns equal values to g, c and to a, t and therefore generates a fourier transform of varying denaturation potential along a stretch of DNA. Fourier analyses have also been performed on the hydrophilichydrophobic variation (and other types of variation) in amino acid character along a polypeptide segment; see, for example, A. D. McLachlan and J. Karn, J. tool. Biol. 164, 605 (1983). The present method, in contrast, treats all symbols (denoting nucleotides or amino acids) on an equal footing, to yield information about symbol-sequence periodicities.
A MEASURE
OF DNA
PERIODICITY
297
For each n, the sum of the absolute values squared of the Fourier coefficients g(n) can be written g ( n ) = [C,I~2 + I C ~72I +IC~I~2 (3.1) 1
f (m). f (m') e -'~'i'''-''')/M = a M 2 m ~. m'
(3.2)
1
= 4 M 2 ~ Y.,,,cos 0,,.,,, e -"'~ic"-"')/M
(3.3)
1
=4M 2 ~ [~ f (rn) . f ( m - I ) ] e -''~''/M
(3.4)
g(n) is a useful scalar measure of the spectral content of an arbitrary stretch of D N A . t It has several nice features. (1) It is, of course, invariant to the set of orthogonal axes chosen, ¢, n and v to calculate it conveniently. (2) It is invariant to permutation of the labelling of the bases since it depends only upon the alignment or nonalignment of every pair of bases, i.e. cos 0. . . . takes on only one of two values in equation (3.3). (3) It is the Fourier transform of the inner product autocorrelation function of f ( m ) (equation (3.4)). In the next section we illustrate some results of calculating g(n) for several different D N A sequences. Applications
Figures 3(a) through 3(f) illustrate the Fourier transform, 4M2g(n), for a stretch of 100 bases in several D N A non-coding regions. Figure 3(a) shows the results of calculation for the stretch of bases centered at position 670 within the 5' flanking region o f the T24 h u m a n bladder oncogene (Reddy, 1983). The simple sequence repeat ggcggcggc.., beginning at position 680 and ending at position 1000 yields the peak at value 33 on the abscissa. Other triplet periodic repeats will contribute at or near this value o f the abscissa. It is of interest that the normal cellular homolog o f this oncogene was found with a deletion in the vicinity of this simple sequence repeat. This is consistent with the general observation that simple sequences are hot spots for recombinational events. It is also of interest that the mutational hot spot at the codon for the 12th amino acid of the normal protein is part of the unique coding sequence ggcgg which resembles part of the simple sequence repeat beginning at position 680. Figure 3(b) shows the transform farther downstream for 100 bases centered at position 1550 also in the 5' flanking region o f the T24 h u m a n bladder oncogene "t"For the use of the autocorrelation function (rather than its Fourier transform) to detect regularities in DNA sequences, see for example E. N. Trifonov and J. L, Sussman, PNAS77, 3816 (1980). Evaluating instead the Fourier transform of the autocorrelation function directly gives information as to the relative oscillation amplitudes for various periodicities, including the 10-5-base periodicity emphasized in the reference.
298
B. D. S I L V E R M A N
A N D R. L I N S K E R
1000 [a)
,
(b)
(c)
(e)
(f)
800 600 400 200 ~-~ IOO0 (d) 8OO 600 400 200 o 0
10
20
50
40
500
10
20
50
40
50 0
10
20
30
40
50
Reciprocal period x 100
FIG. 3. Fourier transform, 4M2G(n) vs reciprocal period x 100. (a) T24 Human Bladder Oncogene, SC(strand center) = 670. (b) T24 Human Bladder Oncogene, SC = 1550. (c) Human Somatostatin I Gene S C = 836. (d) Proviral SSV G e n o m e S C = 160. (e) X. laevis oocyte 5S D N A SC =60. (f) X. laevis oocyte 5S D N A SC = 240.
(Reddy, 1983). There are two pronounced peaks at 17 and 19. They reflect the presence of several five and six base repeats involving either three or four contiguous guanine bases. Figures 3(c) and 3(d) exhibit peaks at 49 and 50, expressing the significant number of dinucleotide simple sequence tandem repeats T G T G T G . . . of the human somatostatin I gene (Shen & Rutter, 1984) and the proviral SSV genome (Devare et al., 1983). It should be emphasized again that other dinucleotide tandem repeats will also contribute to this peak.
¼60°I 400 200 0 0
•
•
10
•
20
50
•.,
40
50
Reciprocol period x 100
FIG. 4. Fourier Transform, 4M2G(n) vs reciprocal period x 100. Coding region--4th exon of the T24 Human Bladder Oncogene SC = 3300.
A MEASURE
OF DNA PERIODICITY
299
Figures 3(e) and 3(f) show peaks in the Fourier transform associated with the longer eight base pair simple sequence repeat at two different positions in the A-T rich spacer region of X. laevis oocyte 5S DNA (Federoff & Brown, 1978). Variation in the redundancy of this oligonucleotide accounts for much of the repeat length variation in the genomic 5S DNA of X. laevis. Figure 4 shows the transform for a stretch of DNA 100 base pairs within the 4th exon of the T24 human bladder carcinoma oncogene. There is an enhanced Fourier component of period three, however, visual examination of this stretch of DNA (Fig. 5) does not simply reveal the origin of this periodicity. Closer examination, however, shows the periodicity to be related to the recurrence of guanine in the third position of a number of contiguous codons.
CCTTC TACA(~ETTGGTGCG ~GAC,A TCCGC,~AGCACAA GC~'GCGC,AAG CTG La Phe Tyr Thr Leu Val Arg Glu lie Arg Gin His Lys Leu Arg Lys Leu AACC CTCCT~ATGAGAGTG(~CC CCGGCTG~ATr~AGCTGC,~AC-TGTGTGC r
Asn Pro ProAsp G|u Set GIy ProGly CysMet Set CysLys Cys Val Le
FIG. 5. Sequence of 100 bases centered at position 3300. Dashes have been placed over guanines in the third codon position,
Finally, Fig. 6 shows the calculated transform when the 128 base pair segment sampled just straddles the 64 base pair tandem repeat in the early promoter region of SV40 deletion mutants (Benoist & Chambon, 1981). One gets a spectrum where every alternate Fourier component has enhanced spectral content. One generally obtains the most dramatic and enhanced Fourier components, of course, when the number of tandem repeats just fits within the window of the number of bases being examined. 1000
800 .~. 6 0 0
~r 400 200 0 0
FIG. 6. Fourier transform,
10 20 30 40 50 60 Reciprocal period x 128
4M2G(n) vs reciprocal
70
p e r i o d x 128 SV40 Deletion Mutant SC = 124.
Conclusions
A Fourier transform g(n) of the sequence of bases along a stretch of DNA has been defined. The transform is useful in at least two different ways. First, it should
300
B. D. S I L V E R M A N A N D R. L I N S K E R
provide a quantitive and comparative measure of DNA periodicity even for segments of D N A having significantly different base content. Second, the function can be used to search for all pronounced periodicities within one sequence with a single calculation--the power of the Fourier transform. This will, perhaps, become o f even greater significance in the future as the DNA data base grows rapidly in size. REFERENCES BENOIST, C. & CHAMBON, P. (1981). Nature 290, 304. BRI'VrEN, R, J. & KOHNE, D. E. (1968). Science 161, 529. CARROLL, D., GARRE'VI', J. E. & LAM, B. S. (1984). Marcromol. Cell. Biol. 4, 254. CHENG, J., PRINTZ, R., CALLAGHAN, T., SHUEY, D. & HARDISON, R. C. (1981). J. tool. Biol. 176, 1. COHEN, J. B. et al. (1982). Nucleic Acids Res. 10, 3353. DEVARE, S. G., REDDY, E. P., LAW, J. D., ROBBINS, K. C. & AARONSON. S. A. (1983). Proc. hath. Acad. Sci. U.S.A. 80, 731. FEDEROFF, N. V. & BROWN, D. D. (1978). Cell 13, 701. Fox, G. M., HESS, J. F., SHEN, C-K. J., & SCHMID, C. W. (1982). Cold Spring Harbor Syrup. 47, 1131. GEBHARD, W. & ZACHAU, H. G. (1983). J. tool. Biol. 170, 567. HELLER, M., VAN SANTEN, V. & KIEFF, E. (1982). J. Virol. 44, 311. HOUCK, C. M., RINEHART, F. P. & SCHMID, C. W. (1979). J. tool. Biol. 132, 289. JEANG, K. & HAYa/CARD, G. S. (1983). Mol. Cell Biol. 3, 1389. MAROTEAUX, L., HEILIG, R., DUPRET, D. & MANDEL, J. L. (1983). Nucleic Acids Res. 11, 1227. REDDY, E. P. (1983). Science 220, 1061. SCHIMENTI, J. C. & DUNCAN, D. H. (1984). Nucleic Acids Res. 12, 1641. SHEN, L. &RUTTER, W. J. (1984). Science 234, t68. SINGER, M. F. (1982). Cell 28, 433. SLIGHTOM, J. t., BLECHL, A. E. & SMITHIES, O. (1980). Cell, 21, 627. SMITH, G, P. (1976). Science 191, 528. SOLL, D. & ROBERTS, R. J. (eds) (1982). The Applications o f Computers to Research on Nucleic Acids. Washington: IRL Press. SORES, I., LOWRY, J. & KEDES, L. H. (1978). Cell 15, 1033. TANTZ. D. & RENZ, M. (1984). J. mol. Biol. 172, 229.