Statistical properties of nucleotides in human chromosomes 21 and 22

Statistical properties of nucleotides in human chromosomes 21 and 22

Chaos, Solitons and Fractals 23 (2005) 1077–1085 www.elsevier.com/locate/chaos Statistical properties of nucleotides in human chromosomes 21 and 22 L...

601KB Sizes 0 Downloads 29 Views

Chaos, Solitons and Fractals 23 (2005) 1077–1085 www.elsevier.com/locate/chaos

Statistical properties of nucleotides in human chromosomes 21 and 22 Linxi Zhang a

a,*

, Tingting Sun

a,b

Department of Physics, Wenzhou Normal College, Wenzhou 325027, PR China b Department of Physics, Zhejiang University, Hangzhou 310027, PR China Accepted 8 June 2004

Abstract In this paper the statistical properties of nucleotides in human chromosomes 21 and 22 are investigated. The n-tuple Zipf analysis with n = 3, 4, 5, 6, and 7 is used in our investigation. It is found that the most common n-tuples are those which consist only of adenine (A) and thymine (T), and the rarest n-tuples are those in which GC or CG pattern appears twice. With the n-tuples become more and more frequent, the double GC or CG pattern becomes a single GC or CG pattern. The percentage of four nucleotides in the rarest ten and the most common ten n-tuples are also considered in human chromosomes 21 and 22, and different behaviors are found in the percentage of four nucleotides. Frequency of appearance of n-tuple f(r) as a function of rank r is also examined. We find the n-tuple Zipf plot shows a power-law behavior for r 4n 1 and a rapid decrease for r > 4n 1. In order to explore the interior statistical properties of human chromosomes 21 and 22 in detail, we divide the chromosome sequence into some moving windows and we discuss the percentage of ng (n, g = A, C, G, T) pair in those moving windows. In some particular regions, there are some obvious changes in the percentage of ng pair, and there maybe exist functional differences. The normalized number of repeats N0(l) can be described by a power law: N0(l)  l l. The distance distributions P0(S) between two nucleotides in human chromosomes 21 and 22 are also discussed. A two-order polynomial fit exists in those distance distributions: log P0(S) = a + bS + cS2, and it is quite different from the random sequence. Ó 2004 Elsevier Ltd. All rights reserved.

1. Introduction In the last few years, numerous DNA sequences originating from a large number of organisms have been produced by various genome projects. The information grows every day, making even more difficult for a researcher to handle and extract valuable information from it. All the data need to be processed and biological information to be extracted, with the ultimate goal of understanding the functionality of DNA sequences and use it to the advantage of biology and medicine. Many computational tools such as BLAST [1,2], FASTA [3,4], CLUSTAL [5,6] etc. have been developed to compare different data, match patterns between different organisms or within the same organism of perform multiple alignment

*

Corresponding author. Tel.: +86 571 879 53261; fax: +86 571 879 51328. E-mail address: [email protected] (L. Zhang).

0960-0779/$ - see front matter Ó 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.chaos.2004.06.022

1078

L. Zhang, T. Sun / Chaos, Solitons and Fractals 23 (2005) 1077–1085

and word matching. The methods are usually formulated for chromosomal regions with already known functionality. Besides these, in recent year, a number of works concerning the study of the intriguing statistical behavior of DNA sequences have been increased. The statistics of DNA sequences is an active topic of research nowadays. The most common methods for these researches include the power spectral density [7–11], the correlation function [8,9,12–15], the walker representation [16–19], the mutual information function [20,21], Zipf analysis [22], n-tuple Zipf analysis [23], and the distribution of SSR, such as DTR [24–26] etc. In this paper, we explore the statistical properties of nucleotides in human chromosomes 21 and 22. We take advantage of the recently completed and annotated DNA sequences of human chromosome 21(C21, 33924807 bp) and 22(C22, 34352073 bp) [27,28]. We mainly focus on the statistical properties of simple sequence repeats (SSR). SSR can be represented as (X1X2  XN)l, where Xi = 1, 2, . . ., N is one of A, C, G or T, and l is the number of repeat unit, which can, in some cases, be of the order of 100. SSR are of considerable practical and theoretical interest due to their high polymorphism [29]. The organization of the paper is: Section 2 describes the method of n-tuple Zipf analysis and the application to human chromosomes 21 and 22, where n varies from 3 to 7. Then in Section 3, we divide the DNA sequence into many windows and discuss the percentage of ng (n,g = A, C, G, T) pair in moving windows for human chromosomes 21 and 22. The distribution of dimeric tandem repeats (DTR) is presented in Section 4. In the last Section, we present a new parameter named distance distribution P0(S) and discuss the correlation between P0 (S) and distance S in human chromosomes 21 and 22. Also we compare the results with the random sequence.

2. n-tuple Zipf analysis Zipf analysis [22] and n-tuple Zipf analysis [23] techniques are two types of analysis bringing information on coherence. They are initially applied to signals or ‘‘text’’. In conventional Zipf analysis, the frequency of occurrence of words present in a given text is measured by counting the number of occurrences of each word throughout the text and dividing this value by the number of words. However, in the study of the complexity of symbolic sequences the usual approach is to investigate the statistical properties of the substrings of length n obtained from the symbolic text by progressively shifting over the text a window of n characters. In this part, we use n-tuple Zipf analysis to investigate the statistical properties of 3-, 4-, 5-, 6- and 7-tuples in human chromosomes 21 and 22. The first test concerns the frequency of occurrence of the 3-, 4-, 5-, 6-, and 7-tuples, and the results are given in Table 1. Here we analyze 5-tuples in detail, and the others follow the same basic rule. The most common 5-tuples are those which consist only of adenine (A) and thymine (T). Exactly, the sequences AAAAA and TTTTT are by far the most common 5-tuples in chromosomes 21 and 22. The reason may be that there are known extensive areas of adenines (A) and thymines (T) in the non-coding parts of the human DNA. It is also seen that the next common 5-tuples are different in chromosomes 21 and 22. In chromosome 21, the next common 5-tuples are combinations of adenine and thymine. However, in chromosome 22, the next common 5-tuples are contained of cytosine (C) and guanine (G). The rarest 5-tuples are those in which the pattern GC or CG appears twice. CGTCG, CGACG, TCGCG are the three least common 5-tuples, which are the same for both chromosomes 21 and 22. The areas rich in GC are promoters. With the 5-tuples become more and more frequent, the double GC or CG pattern becomes a single GC or CG pattern. As it is shown in Table 1, the 3-, 4-, 6-, and 7-tuples all follow the similar way. Besides these, the percentages of four nucleotides (A, C, G, T) in the rarest ten and the most common ten 3-, 4-, 5-, 6-, and 7-tuples are also discussed here, and the results are shown in Fig. 1. Seeing from Fig. 1(a), we investigate that in human chromosome 21, the percentages of A and T in the most common ten 3-, 4-, 5-, 6-, and 7-tuples are close to 50% respectively. Therefore, the percentages of C and G are very small. On the other hand, in the rarest ten 3-, 4-, 5-, 6-, and 7-tuples, the percentages of C and G are much larger than that of A and T. All the conclusions are in accord with the results from Table 1. Fig. 1(b) is about chromosome 22, and there is something different. In the rarest ten 3-, 4-, 5-, 6-, and 7-tuples, the results are almost the same as human chromosome 21. In the most common ten 3-, 4-, and 7-tuples, the percentages of C and G are larger than that of A and T. While in the most common ten 5- and 6-tuples, the frequencies of appearance of A and T are larger than that of C and G. This is different from human chromosome 21. We perform n-tuple Zipf analysis by varying n from n = 3 to n = 7. f(r) represents the frequency of appearance of ntuple. By sorting out the n-tuple according to their frequency, a rank r can be assigned to each n-tuple, with r = 1 for the most frequent one. So the correlations of f(r) with rank r of n-tuple for chromosomes 21 and 22 are shown in Fig. 2. Here n ranges from n = 3 to n = 7 in Fig. 2(a), (b), (c), (d), and (e), respectively. We find that the n-tuple Zipf plot shows a power-law behavior for r 4n 1 and a rapid decrease of the frequency for r > 4n 1. The result is also presented in other DNA sequences in Ref. [23]. We also find that the trends of the plots of the frequency f(r) vs. rank r for human chromosomes 21 and 22 are almost the same. But there also exist some differences in some regions. In the region of small

L. Zhang, T. Sun / Chaos, Solitons and Fractals 23 (2005) 1077–1085

1079

Table 1 The appearances of the various 3-, 4-, 5-, 6-, and 7-tuples in ascending order for human chromosomes 21 and 22, respectively Triplicates

Quaternities

Quintuplets

Sextuplets

Septenaries

Human chromosome 21 TGA CGA CGC GCG CGT ACG GCC CCG GTC GAC ... ACA ATA TTA TAA TCT AGA ATT AAT TTT AAA

CGCG GTCG CGAC TACG CGTA ATCG TCGC TCGA CGAT GCGA ... AATA AATT TTCT AGAA TTTA TAAA ATTT AAAT TTTT AAAA

CGTCG CGACG TCGCG CGCGA CGCGT ACGCG GTACG CGTAC TCGAC GTCGA ... TTTCT AGAAA TTTTA TATTT TAAAA AAATA ATTTT AAAAT TTTTT AAAAA

TACGCG CGTCGA CGCGTA TCGACG CGTACG TCGTCG CGACGA CGCGAA CGATCG TTCGCG ... TTATTT TTTTTA AAATAA TAAAAA TATTTT AAAATA ATTTTT AAAAAT TTTTTT AAAAAA

CGGTACG TCGTCGA GTACGCG TACGCGC CGAACCG TACGCGA TTACGCG CGCGTAA CGTAACG CGTACGA ... AAAAAAT AAAGAAA TTTTAAA TTTAAAA TTTATTT TATTTTT AAAAATA AAATAAA AAAAAAA TTTTTTT

Human chromosome 22 TCG CGA CGT ACG CGC GCG CGG CCG GTA TAC ... GAG CTC AGG TGG CCT CCA CTG CAG TTT AAA

TACG CGTA CGCG GTCG CGAC TTCG ATCG CGAT CGTT TCGT ... GGAG CTCC TGGG CCCA CTGG CCAG CAGG CCTG TTTT AAAA

CGTCG CGACG TCGCG ACGCG CGCGT TATCG CGATA CGCGA GTACG CGTAC ... ATTTT GGAGG AAAAT CCTCC CCTGG CCAGG CTGGG CCCAG TTTTT AAAAA

TACGCG CGTACG CGTCGA TCGACG CGCGTA CGTTCG CGATCG TTCGCG CGCGAA CGAACG ... GCTGGG CCCAGC GGGAGG CCTCCC ATTTTT CAGCCT AAAAAT AGGCTG TTTTTT AAAAAA

CGGTCGA CGTCGTA TCGCGTA CGCGTAT CGACGTA TATCGCG TCGATCG CGTACGA TACGCGA ATTCGCG ... TGGGAGG CAGCCTG CCTCCCA GCCTCCC GGGAGGC CAGGCTG CCAGCCT AGGCTGG TTTTTTT AAAAAAA

value of r, the frequencies of n-tuple of chromosome 21 are larger than those of chromosome 22. While in the rapid decreasing tail, the frequencies of n-tuple of chromosome 21 are smaller in the region of large r.

3. Percentage of ng pair in moving window Although many researchers have done much work on statistical analysis of human chromosomes 21 and 22, they always do their work over the whole sequence. In this part, we want to investigate the interior statistical features in different part of the chromosomes. Therefore, we divide the chromosome sequence into some moving windows. For

1080

L. Zhang, T. Sun / Chaos, Solitons and Fractals 23 (2005) 1077–1085 60

50

percentage (%)

40

30

20 3-tuple 4-tuple 5-tuple 6-tuple 7-tuple

10

0

(a)

A

C

G

--

T

40

percentage (%)

30

20

3-tuple 4-tuple 5-tuple 6-tuple 7-tuple

10

(b)

0 A

C

G

T

--

Fig. 1. The percentages of four nucleotides (A, C, G, T) in the rarest ten 3-(j), 4-(d), 5-(m), 6-(.), 7-tuples (H), and the most common ten 3-(h), 4-( ), 5-(n), 6-(,), 7-tuples (q). Here (a) human chromosome 21, and (b) human chromosome 22.

chromosome 21, we divide the sequence into 170 moving windows, and for chromosome 22, the sequence is divided in to 172 moving windows. The window length is 200 kbp. In fact, the length of last window is less than 200 kbp. Then we calculate the percentage of ng pair in the moving windows for human chromosomes 21 and 22. Here n, g can be A, C, G or T. And the results are given in Fig. 3. For clarity, we separate plots for these 16 groups by shifting them by a factor of 3 on the ordinate. In Fig. 3(a), there are three positions where the percentage of ng is relatively obvious different for different ng pair, and they are pointed out in Fig 3(a). One position is at the sequence position between 0.80 and 1.0 Mbp (signal A), the second position is between 4.4 bp and 4.6 Mbp (signal B), the last position is between 28.2 Mbp and 32.6 Mbp (signal C). We find that in those three positions, the percentages of some ng pairs are larger, such as GC, CG, CC, and GG pairs. In the meantime, the percentages of other ng pairs are smaller, such as AA, AT, TA, and TT pairs. Comparatively, human chromosome 22 is more complex than that of human chromosome 21 in Fig. 3(b). It is shown that there are more positions than in chromosome 21. Here, we have noticed some obvious positions, marked signal A, B, C, D, E, F, and G in Fig. 3(b). Position A is about the sequence position between 400 kbp and 600 Kbp, B is near 3.8–4.0 Mbp, C is about 5.4–6.4 Mbp, D is near 12.0–12.2 Mbp, E is 19.4–19.6 Mbp, F is 29.6–29.8 Mbp, and G is 33.8–34.0 Mbp. In these positions, we can easily see from Fig. 3(b), the percentages of some ng pairs become larger, in the same time, the percentages of other ng pairs become smaller. These regions must have some functional features. This analysis method can provide some insights into human chromosomes 21 and 22.

4. DTR distributions DTR means dimeric tandem repeats which is the most simple case of SSR (simple sequence repeats). In this section, we analyze distributions of DTR in human chromosomes 21 and 22. DTR method is widely used to analyze DNA

L. Zhang, T. Sun / Chaos, Solitons and Fractals 23 (2005) 1077–1085

1081

1000000

f (r)

f (r)

100000

100000

10000 chromosome 21 chromosome 22

1

chromosome 21 chromosome 22 1

10

(a)

10

(b)

r

100

r

chromosome 21 chromosome 22

100000

100000

f (r)

f (r)

10000

10000

1000

1000 chromosome 21 chromosome 22 100

1

(c)

10

100

1000

(d)

r

1

10

r

100

1000

chromosome 21

100000

chromosome 22

f (r)

10000

1000

100

10 1

(e)

10

100

1000

10000

r

Fig. 2. Frequency of appearance f(r) as a function of rank r of (a) 3-, (b) 4-, (c) 5-, (d) 6-, and (e) 7-tuples.

sequences [24–26]. First we calculate number of occurrence N(l) of dimeric tandem repeats of l repetition for 16 types of dimmers. We combine results for six groups of DTR: (1) AA, TT; (2) TA, AT; (3) CA, AC, TG, GT; (4) CC, GG; (5) GA, AG, TC, CT; and (6) GC, CG. We use this classification because A is complementary to T, and C is complementary to G; and we average over two possible directions of reading DNA sequences. In addition, we combine data for repeats xy and yx, where x and y denote nucleotides A, C, G, T, since repeats xy and yx have almost identical distributions. Because repeat (xy)l must become (yx)l ± 1 if one shifts the reading frame by 1 bp. Then we calculate the normalized number of repeats N0(l) = N(l)/N(1) of length l, where N(1) is the total number of occurrences of a single dimmer. Fig. 4 gives the plots of average normalized number of repeats for six groups of dimeric tandem repeats for human chromosomes 21 and 22 in double-logarithmic scale. We find that the normalized number of repeats N0(l) has a good linear correlation with l in double-logarithmic scale. That means distributions N0(l) can be described by a power law: N0(l)  l l. The results are in accord with the DTR results for noncoding regions of DNA sequence [24]. It is known

1082

L. Zhang, T. Sun / Chaos, Solitons and Fractals 23 (2005) 1077–1085 B

A

C

0. 1 percentage of ξ η pair in moving window

0. 01 1E -3 AA -1 AG x3 -2 AT x3 -3 AC x3-4 GA x3-5 GG x3 -6 GT x3 -7 GC x3 -8 TA x3 -9 TG x3 -10 TT x3 -11 TC x3 -12 CA x3 -13 CG x3 -14 CT x3 -15 CC x3

1E -4 1E -5 1E -6 1E -7 1E -8 1E -9 1E -1 0 1E -1 1 1E -1 2 1E -1 3 1E -1 4 0

5

10

(a)

15

20

25

30

35

F

G

40

sequence position [Mbp]

A

B C

D

E

percentage of ξ η pair in moving window

0. 1 0. 01 AA -1 AG x3 -2 AT x3 -3 AC x3 -4 GA x3-5 GG x3 -6 GT x3 -7 GC x3 -8 TA x3 -9 TG x3 -10 TT x3 -11 TC x3 -12 CA x3 -13 CG x3 -14 CT x3-15 CC x3

1E -3 1E -4 1E -5 1E -6 1E -7 1E -8 1E -9 0

(b)

5

10

15

20

25

30

35

40

sequence position [Mbp]

Fig. 3. Percentage of ng pair in moving window (%) for human chromosomes 21 (a) and 22 (b). Here the length of moving window is 200 kbp.

that in human chromosome sequences, coding region only takes up 5% while noncoding region takes up 95%. So it is not difficult to understand that the DTR results for human chromosomes 21 and 22 are similar with that for noncoding regions of DNA sequence. The plots of DTR in human chromosomes 21 and 22 are very near when l is small. As l becomes larger, the difference between chromosome 21 and 22 becomes clearly. We find the values of l for these six groups of repeats in human chromosome 21 are 3.4, 3.0, 2.7, 3.7, 6.5, and 5.9, respectively. In human chromosome 22, the values of l for these six groups of repeats are 3.8, 3.2, 2.9, 3.9, 6.2, and 6.2, respectively.

5. Distance distribution P0(S) Here we analyze the distance distribution P0(S) as a function of distance S between two nucleotides in human chromosomes 21 and 22. First, we give the definition of the distance distribution P0(S), here S is a distance between two nucleotides. If there is no other nucleotides between the two nucleotides, we consider that the distance between the two nucleotides is 1. If there is one nucleotide between the two nucleotides, we consider that the distance between

L. Zhang, T. Sun / Chaos, Solitons and Fractals 23 (2005) 1077–1085

1083

10 0.1 1E-3 1E-5

N0 (l )

1E-7 1E-9 1E-11 AA,TT AT-TA CA-AC,TG-GT GA-AG,CT-TC CC,GG CG-GC

1E-13 1E-15 1E-17 1

10

l

Fig. 4. The combined plots of average normalized member of repeats for six groups of dimeric tandem repeats for human chromosomes 21 and 22 in double-logarithmic scale. The hollow symbols represent human chromosome 21 and the solid symbols represent human chromosome 22. For clarity, we separate plots for these six groups by shifting them by a factor of 102 on the ordinate.

10 Chromosome 22 A-A Chromosome 22 C-G Chromosome 22 C-C Chromosome 21 A-A Chromosome 21 C-G Chromosome21C-C random sequence C-G

1 0. 1 0. 01

P (S)

1E -3 1E -4 1E -5 1E -6 1E -7 1E -8 0

20

40

60

(a)

80

100

S 1 logP0 (S ) = -0.17485-0.11356 S+5.64502E-4 S

0. 1

2

Chromosome 22 A-A

P0 (S )

0. 01

1E -3

1E -4

1E -5

(b)

0

20

40

S

60

80

100

Fig. 5. Distance distribution P0(S) as a function of distance S for A–A, C–G and C–C of human chromosomes 21 and 22. Another plot is P0(S) as a function of S for C–G of a random sequence (33924807 bp). Here S is distance between two nucleotides (A, C, G, and T), (b) gives a two-order polynomial fit to A–A for human chromosome 22.

1084

L. Zhang, T. Sun / Chaos, Solitons and Fractals 23 (2005) 1077–1085

the two nucleotides is 2. So if the number of nucleotides between the two nucleotides is N, the distance between the two nucleotides is N + 1. For example, give a sequence as . . .ATCAGTG. . ., we can know the distance between two nucleotides A and A is S = 3, and S = 1 for T–C, S = 2 for G–G, etc. . . Here we also define the distance distribution P0(S) as: P0(S) = N(S)/N(1). Here N(S) is the number of the pairs of two nucleotides whose distance is S, and N(1) is the number of the pairs of two nucleotides whose distance is 1. The distance distributions P0(S) as a function of distance S for A–A, C–G, C–C in human chromosomes 21 and 22 are given in Fig. 5(a). We find that the plots all have two-order polynomial fits. For example, Fig. 5(b) gives a two-order polynomial fit for A–A of human chromosome 21. So we can obtain the correlation between P0(S) and S for human chromosomes 21 and 22: logP0(S) = a + bS + cS2. For example, the values of a, b, and c for A–A in human chromosome 21 are 0.286, 0.121, and 0.000636, respectively, and there are 0.175, 0.114, and 0.000564, respectively in human chromosome 22. The values of a, b, and c for C–G in chromosome 21 are 0.529, 0.138, and 0.000902, respectively, and there are 0.839, 0.129, and 0.000837 in human chromosome 22. In fact, the distance distributions P0(S) for the other nucleotides are the same as A–A, C–G, and C–C. We also generate a random sequence contained A, C, G, T and the length of nucleotides is 33924807 bp. We calculate the distance distribution P0(S) as a function of distance S for C–G, and the result is also given in Fig. 5(a). We observe the plot fits a good linear in the semi-logarithmic scale. That is to say, P0(S) follows an exponential decay, and the correlation equation is: logP0 (S) = 0.205 0.293S. This means human chromosomes 21 and 22 are quite different from random DNA sequence. These investigations can provide some insights into human chromosomes 21 and 22, and may help us to explore some statistical properties of other DNA sequences.

Acknowledgments This research was financially supported by National Natural Science Foundation of China (nos. 20174036, 20274040), and Natural Science Foundation of Zhejiang Province (no. 10102).

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19]

Altschul SF, Gish W, Miller W, Myers E, Lipman D. A basic local alignment search tool. J Mol Biol 1990;215:403–10. http://www.ncbi.nlm.nih.gov/BLAST. Pearson W, Lipman D. Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 1988;85:2444–8. http://www.ebi.ac.uk/fasta33. Higgins D, Bleasby A, Fuchs R. Improved software for multiple sequence alignment. CABIOS 1992;8:189–91. http://www.edi.ac.uk/clustalw. Li W, Kaneko K. Long-range correlation and partial 1/f a a Spectrum in a noncoding DNA sequence. Europhys Lett 1992;17:655–60. de Sous Vieira M. Statistics of DNA sequences: a low-frequency analysis. Phys Rev E 1999;60:5932–7. Li W. The study of correlation structures of DNA sequences: a critical review. Comput Chem 1997;21:257–71. Buldyrev SV, Goldberger AL, Havlin S, Mantegna RN, Matsa ME, Peng C-K et al.. Long-range correlation properties of coding and noncoding DNA sequences: genbank analysis. Phys Rev E 1995;51:5084–91. Voss RF. Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys Rev Lett 1992;68:3805–8. Holste D, Weiss O, Grobe I, Herzel H. Are non-coding sequences of R. Prowazekii remants of ÔneutralizedÕ genes. J Mol Evol 2000;51:335–62. Bernaola-Galvan P, Carpena P, Roman-Roldan R, Oliver JL. Study of statistical correlations in DNA sequences. Gene 2002;300:105–15. Sun TT, Zhang LX, Chen J, Jiang ZT. Statistical properties and fractals of nucleotide clusters in DNA sequences. Chaos, Solitons Fractals 2004;20:1075–84. Zhang LX, Jiang ZT. Long-range correlations in DNA sequences using 2D DNA walk based on pairs of sequential nucleotides. Chaos, Solitons and Fractals 2004;22:947–55. Peng C-K, Buldyrev SV, Goldberger AL, Havlin S, Simons M, Stanley HE. Finite-size effects on long-range correlations: implications for analyzing DNA sequences. Phys Rev E 1993;47:3730–3. Buldyrev SV, Goldberger AL, Havlin S, Peng C-K, Simons M, Stanley HE. Generalized Levy-walk model for DNA nucleotides. Phys Rev E 1993;47:4514–23. Buldyrev SV, Goldberger AL, Havlin S, Peng C-K, Stanley HE, Simons M. Fractal landscapes and molecular evolution: modeling the myosin heavy chair gene family. Biophys J 1993;65:2673–9. Peng C-K, Buldyrev SV, Goldberger AL, Havlin S, Sciortino F, Simons M et al. Long-range correlation in nucleotide sequences. Nature (London) 1992;356:168–71.

L. Zhang, T. Sun / Chaos, Solitons and Fractals 23 (2005) 1077–1085 [20] [21] [22] [23] [24] [25] [26] [27] [28] [29]

1085

Herzel H, Grobe I. Correlations in DNA sequences: the role of protein coding segments. Phys Rev E 1997;55:800–10. Herzel H, Trifonov EN, Weiss O, Grobe I. Interpreting correlations in biosequences. Physica A 1998;248:449–59. Zipf GK. Human Behavior and the Principle of Least Effort. Cambridge MA: Addisson-Wesley; 1949. Mantegna RN, Buldyrev SV, Goldberger AL, Havlin S, Peng CK, Simons M, Stanley HE. Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics. Phys Rev E 1995;52:2939–50. Dokholyan NV, Buldyrev SV, Havlin S, Stanley HE. Distributions of dimeric tandem repeats in non-coding and coding DNA sequences. J Theor Biol 2000;202:273–82. Buldyrev SV, Dokholyan NV, Havlin S, Stanley HE, Stanley RHR. Expansion of tandem repeats and oligomer clustering in coding and noncoding DNA sequences. Physica A 1999;273:19–32. Dokholyan NV, Buldyrev SV, Havlin S, Stanley HE. Model of unequal chromosomal crossing over in DNA sequences. Physica A 1998;249:594–9. Hattori M, Fujiyama A, Taylor JD, et al. The DNA sequence of human chromosome 21. Nature 2000;405:311–9. Dunham I, Hunt AR, Collins JE, Bruskiewich R. The DNA sequence of human chromosome 22. Nature 1999;402:489–99. Bowcock AM, Buiz-Linares A, Tomfohrde J, Minch E, Kidd JR, Cavalli-Sforza LL. High resolution of human evolutionary trees with polymorphic microsatellites. Nature 1994;368:455–7.