Wavelet analysis of DNA sequences

Wavelet analysis of DNA sequences

Genetic Analysis: Biomolecular Engineering 12 (1996) 165-168 ELSEVIER Wavelet analysis of DNA sequences M. Altaiski*“.‘, 0. Mornevb, R. Polozovb ...

1MB Sizes 1 Downloads 95 Views

Genetic Analysis: Biomolecular Engineering 12 (1996) 165-168

ELSEVIER

Wavelet analysis of DNA sequences M. Altaiski*“.‘,

0. Mornevb,

R. Polozovb

‘Centre for Applicable Mathematics, B.M. Birla Science Centre. Adnrshnagar, Hyderabad, 500463, India bInstiiute of Theoretical and Experimental Biophysics, Puschino. Moscow Region, 142292. Russia

Received I9 July 1995; revision received 22 August

1995; accepted

14 September

1995

Abstract Wavelet decomposition is applied to the analysis of the nucleotide sequence of the rhodopsin gene of Chinese Hamster cells. The Lipschitz-Hclder exponents for the probability measurements of adenine, guanine, thymine and cytosine distributions are obtained. The local scaling found by means of wavelet analysis is argued to be an indication of long-range correlations. Keywords:

DNA; Wavelets;

Long-range

correlations

1. Introduction

The analysis of nucleotide sequences remains one of the important problems of modem molecular biology. The principal information about living organisms is encoded in very long sequences of nucleotides which consist of four basic elements (Cletter alphabet): adenine (A), guanine (G), thymine (T) and cytosine (C). For this reason the search for regular structures and correlations in polynucleotide chains is a huge statistical problem. Recently, a power law for long-range correlations was found in certain nucleotide sequences [ 1,2]. The method by which this power law was obtained was not strictly rigorous; that’s why the very existence of long-range correlations is still under question and is being actively discussed [ 1- 141.The scaling law or, more precisely, the multifractal behavior of the corresponding probability measure, if proved to be valid, could be considered as a blueprint for some pronounced self-similar structures present in nucleotide sequences; similar to those in the physics of fractals [ 16,171. In the present paper, we study the scaling properties of the nucleotide sequence of the rhodopsin gene of the

Chinese Hamster cells [ 181. Having constructed four probability measures (for A,T,C and G, respectively), we performed the wavelet decomposition which has long been proved to be an excellent tool for singular measure analysis, and found the scaling law. In the absence of strict global scaling, we have found a local one (see results and Figs. l-8), in the range of several hundreds of nucleotides which could be related to the long correlations of the same range. The structure of these correlations is displayed as a branching process. Graphic representation of wavelet coefficients corresponding to these measures can be used for further analysis. 2. The mathematical framework In general, the occurrence of a certain nucleotide in a

certain position of the DNA chain, labeled by a length parameter I, can be described as a random process X(1, m). Thus, for the case of the above mentioned 4letter alphabet, we deal with a probability space (a, U,P), with fl= (A,T, C, G 1 and a family of four random processes X,= (X,(1, 0); I E R, w E Q),

* Corresponding author. ’Permanent address: Joint Institute for Nuclear Research, Dubna, 141980, Russia. Fax: +007 09621 66666: e-mail: [email protected]. Copyright 0 1996 Published by Elsevier Science B.V. All rights reserved SSDI 1050-3862(95)00129-A

such that X,U,w) =

hf. Altaiski et al. /Genetic Analysis: Biomolecular Engineering I.? (19%)

166

Instead of calculating correlations, as was done in [1,2], we proceed with the integral measures cc2W =

’ s 0

X2 (I,o)dP(o)dl

=

‘dp2, s0

(1)

165-168

&l(x) = C-1)n+‘d “ldx” exp( -x2/2) which has long been known as an effective tool in fractal analysis [ 171. 3. Results

which count the total number of each of the nucleotides z = A, T, C, G up to the I-th position in the chain, Since the measures in equation 1 are supposed to be generally non-differentiable, we first have to study their scaling behavior cc(x) - /4(x0) = Ix - X*lh

(2)

The extraction of the Lipschitz-Holder exponent h from the experimentally obtained measure is a typical problem in the physics of fractals. One of the most reliable methods to find it is the method of wavelet transform. Referring the reader to [ 171 and references therein for details, we only present the basic theorem [ 151 we made use of and then turn to the results.

In this investigation, we calculated the wavelet coefficients TB(a,x) on a discrete lattice of 10 scales a = 2’, i = O,..., 9 at 8192 points, which is the maximum binary power below the length of the available sample [ 181 of 11 838 nucleotides. The logarithmic plots log2 I T,Ja,x) I for the measures pa, P,, cl0 p, are presented on Figs. l-4, respectively. The plots were obtained at the middle of the range, x, = 4096; however the behavior of the sections at other points is not seriously different. The corresponding Lipschitz-Holder exponents are presented in the following table:

h/i

hT

hc

0.60

0.43

0.60

hG 0.53

Theorem 1

If p(x) is a bounded locally integrable function that satisfies

These coefficients are conspicuously

close to the

A&“,“t!

P(X) - p(xO) = 0 (Ix - x01h), h E [O,l] at some point x0 E R, then provided the analyzing wavelet satisfy g E L’, x”g E L’ and the zero-mean condition j g(x) dx = 0

8

I

1

I

1

7-

Q

H=flti 0

I

0

6 -

1% lPXa)l5 0

0

4 -

3<+

its wavelet transform behaves as

0 7 5

6

7

TB(a,x) [p] = O(ah + ‘12)

in thecone lx-x01 5 const . a The wavelet transform is defined in a standard way [19] which differs by the l/2 constant from the notation used in [17].

T&a,4 ifI = j+

g

(+)

log,a

x

7-2

9

IO

H=O 0 43 0

6 lug, 17$(a)i 5 -

0

fWx 4 -

0 0

:I -

For the analysis of the measures obtained from the rhodopsin gene sequence in question, we use the ‘mexican hat’

70

g2(x) = (1 - x2)exp(-x2/2)

Figs. 1-4. The dependence of (binary logarithms of) g,-wavelet coeflicients for the measure functions taken at the middle of the data for the

as a basic wavelet g(x), see equation 3. This is the second in the vanishing momenta wavelet sequence

adenine, thymine, cytosine and guanine, respectively. The values of the Lipschitz-Hiilder exponents presented in the pictures were obtained with the best line approximation.

M. Altoiski et 01. /Genetic

Analysis: Biomoleculor Engineering 12 (19%)

161

165-168

6 256.0 128.0

32.0 16.0 8.0

63.0

3263.0

6463.0

t.ps

5

6

7

8

9

10

lo&a

5

hs = l/2 of the Brownian motion, the purely random process. However, the difference h, - hB where z = A, T, C’,G, which has the magnitude of several percent, can not be regarded as vanishing. This difference can be caused by branching processes which can be clearly seen on the density plots of wavelet coeffkients, at scales approximately equal to 27 or 28, see Figs. 5-8. Thus, we conclude that the scaling in DNA chains does really exist. This scaling is of a multifractal nature (see e.g. [18]), rather than a global one.

256.0 128.0

7

64.0

256.0

32.0

128.0

16.0 8.0

32.0

4.0

16.0

2.0

8.0

1.0

4.0 63.0

3263.0

6463.0

2.0

a.ps Figs. !j-6.The grey density plots of wavelet coefficients (up to 2” vower scale) for the adenine, thymine, cytosine and guanine, respeckvely. The fragmentation processes clearly distinguish at 26, 2’ and 29 seal les.

I .o

63.0

3263.0

c.ps

6463.0

M. Altaiski et al. /Genetic Analysis: Biomolecular Engineering I2 (19%)

168

165-168

Acknowledgements 256.0

Two of the authors, M.A. and O.M., are grateful to Dr. B.G. Sidharth, for hospitality at the B.M. Birla Science Centre and financial support. We are also thankful to Dr. V. Sivozhelezov for critical reading of the manuscript.

32.0

References

16.0

111Peng C-K et al. Nature 1992; 356: 168. PI Voss RF. Phys Rev Lett 1992; 25: 3805-3808.

8.0

63.0

3263.0

6463.0

&PS

[31 Li W. Int J Bifurc Chaos 1992; 2: 137-154. [41 Li W, Kaneko K. Europhys Lett 1992; 17: 655-660. VI Nee S. Nature 1992; 357: 450. [61 Maddox J. Nature 1992; 358: 103-105. 171 Prabhu VV, Claverie J-M. Nature 1992; 359: 782. PI Li W, Kaneko K. Nature 1992; 360: 635. [91 Munson PJ et al. Nature 1992; 360: 636. HOI Amato I. Science 1992; 257: 747. HII Buldyrev SV et al. Phys Rev Lett 1993; 71: 1776. WI Voss RF. Phys Rev Lett 1993; 71: 1777. 1131 Karlin S, Brendel V. Science 1993; 259: 677-680. 1141 Kapitonov VV, Titov II. Dokl Akad Nauk 1994; 337: 810-812 (in Russian).

As an auxiliary result, we can mention that the color density representation of wavelet coefficients, which proved to be a powerful tool for the analysis of general fractals, seems to be of great use for identifying branching processes in nucleotide chains as a computer graphic tool.

1151 Holshneider

M, Tcamitchan P. In: Lemarie PG, Ondelets. Berlin: Springer, 1990. [I61 Feder J. Fractals. New York: Plenum Press, 1988. [I71 Muzy JF et al. Phys Rev E 1993; 47: 875. 1181 Gale JM et al. J Mol Biol 1992; 224: 343-358.

u91 Daubechies

I. Comm Pure Appl Math

ed.

1988; 16: 909-996.

Les