A 4D representation of DNA sequences and its application

A 4D representation of DNA sequences and its application

Chemical Physics Letters 402 (2005) 380–383 www.elsevier.com/locate/cplett A 4D representation of DNA sequences and its application Bo Liao a a,* ,...

104KB Sizes 6 Downloads 106 Views

Chemical Physics Letters 402 (2005) 380–383 www.elsevier.com/locate/cplett

A 4D representation of DNA sequences and its application Bo Liao a

a,*

, Mingshu Tan b, Kequan Ding

a

Department of Applied Mathematics, Graduate School of the Chinese Academy of Sciences, Science 100 Laboratory, Number 19, Yuquan Road, Beijing 100049, China b Chongqing Three Gorges University, Chongqing, Wanzhou 404000, China Received 12 November 2004; in final form 16 December 2004 Available online 4 January 2005

Abstract A 4D representation of DNA sequences has been derived for mathematical denotation of DNA sequence. The 4D representation also avoids loss of information accompanying alternative 2D and 3D representation. The geometrical centers of the 4D graph of DNA sequences indicate the distribution of base frequencies. A interesting phenomenon is observed for Goat and Gallus b-globin genomes with high G + C content. The examination of similarities/dissimilarities among the coding sequences of the first exon of b-globin gene of different species illustrates the utility of the approach. Ó 2004 Published by Elsevier B.V.

1. Introduction Mathematical analysis of the large volume genomic DNA sequence data is one of the challenges for bio-scientists. Graphical representation of DNA sequence provides a simple way of viewing, sorting and comparing various gene structures[1–17]. Nandy [7] presented a graphic representation by assigning A(adenine), G(guanine), T(thymine) and C(cytosine) to the four direction, (x), (+x), (y) and (+y), respectively. Such a representation of DNA is accompanied by: (1) some loos of visual information associated with crossing and overlapping of the resulting curve by itself; (2) an arbitrary decision with respect to the choice of the direction for the four bases. Randic [9] present a 3D graphical representation, but the limitations associated with crossing and overlapping of the spatial curve representing a DNA sequence remain. Hamori present H-curve [11], which is a 3D graphical representation of DNA sequences. The four bases are

*

Corresponding author. Fax: +86 10 88256147. E-mail address: [email protected] (B. Liao).

0009-2614/$ - see front matter Ó 2004 Published by Elsevier B.V. doi:10.1016/j.cplett.2004.12.062

represented by four directions (NW, NE, SE and SW). Basic rule for constructing H-curve is to move one unit in the corresponding direction and another one for each unit in the z-direction. H-curve can uniquely represent a DNA sequence, while it requires 2D projection or 3D stereo projection of DNA sequences. Recently, Randic and Liao also present some 3D and 2D graphical representations [1–4,9,10,16,17], which also avoid the limitations associated with crossing and overlapping. But these representations are accompanied by the computations of L/L matrix and leading eigenvalue, which need a large number of time and space. And most representations are not unique. In this Letter, we introduce a new 4D representation of DNA primary sequences, in which there is also no loss of information in the transfer of data from a DNA sequence to its mathematical representation. Our approach is different with RandicÕs 4D representation [16]. The 4D representation we introduced is simple and direct, and requires no 2D projection or 3D stereo projection for it, and it can be uniquely represent a DNA sequence. Using our approach, one can find that the computation is simple.

B. Liao et al. / Chemical Physics Letters 402 (2005) 380–383

2. 4D representation of DNA sequences and their numerical characterizations It is well known that the four bases A, C, G, T can be divided into two groups in three ways: purine(R = {A, G})-pyrimidine(Y = {C, T}), amino(M = {A, C})keto(K = {G, T}) and weak H-bonds(W = {A, T})strong H-bonds(S = {G, C}). Based on the classifications of the four nucleic acid bases, we shall reduce a DNA primary sequence G = g1, g2, . . . into a series of nodes P0, P1, P2, . . ., PN, whose coordinates xi, yi, zi and si (i = 0, 1, 2, . . ., N, where N is the length of the DNA sequence being studied, Pi corresponds base gi) satisfy  1 if gi 2 fA; Gg xi ¼ 0 if gi 2 fC; Tg;  yi ¼  zi ¼

1 0

if gi 2 fA; Cg if gi 2 fG; Tg;

1

if gi 2 fA; Tg

0

if gi 2 fC; Gg;

381

Obviously, an arbitrary DNA primary sequence is uniquely determined by the 4D representation. For any sequence, we have a set of points (xi, yi, zi, si), i = 1, 2, 3, . . ., N, P where N is P the length P of the i i i sequence. P Let xi ¼ 1i k¼1 xk ; y i ¼ 1i k¼1 y k ; zi ¼ 1i k¼1 zk i i 1 and s ¼ i k¼1 sk . After simple computation, we can i i i obtain xi ¼ Ai þG ; y i ¼ Ai þC and zi ¼ Ai þT . The direct i i i i i i biological meaning of x , y and z is that it indicates the distribution of base frequencies. The coordinates of the geometrical center of the points, denoted by x00 ; y 00 and z00 , may be calculated as follows: x00 ¼ xn ¼ ¼

n 1X xi ; n i¼1

n 1X zi ; n i¼1

y 00 ¼ y n ¼

s00 ¼ sn ¼

n 1X y; n i¼1 i

n 1X zi : n i¼1

z00 ¼ zn ð1Þ

Obviously, the geometrical center is: 1þ12þþ1n Cn þAn Tn þAn ; n ; n ; 1  n Þ The direct biological meaning of ðx00 ; y 00 ; z00 ; s00 Þ is that it indicates the distribution of base frequencies. For example, let the frequencies of bases A, C, G and T be a, c, g and t, respectively. x00 6 y 00 indicates g 6 c, else g P c; y 00 6 z00 indicates c 6 t, else c P t; x00 6 z00 indicates g 6 t, else g P t. In Table 2 we list the geometrical center of the first exon of b-globin gene belonging to 11 species. Observing Table 2, we can obtain some common features of 11 DNA primary sequences, respectively, that are not easily visible from sequences themselves. For example, y 00 ¼ z00 ; x00 ¼ 1  y 00 ¼ 1  z00 appear in the geometrical center of goat, which indicate c = t, c = a, t = a. In Table 3, we present the relations between a, c, g and t. n ðGn þA n

1 si ¼ 1  : i That is to say we have a map /, which maps G into a plot set. Explicitly, /(G) = /(g1)/(g2). . ., where 8 ð1; 1; 1; 1  1i Þ if gi ¼ A; > > > < ð1; 0; 0; 1  1Þ if g ¼ G; i i /ðgi Þ ¼ 1 > ð0; 1; 0; 1  Þ if g > i ¼ C; i > : 1 ð0; 0; 1; 1  i Þ if gi ¼ T:

Table 1 The coding sequences of the first exon of b-globin gene of 11 different species Species

Coding sequence

Human

ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGT GAACGTGGATTAAGTTGGTGGTGAGGCCCTGGGCAG ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGGCTTCTGGGGCAAGGTGAAAGT GGATGAAGTTGGTGCTGAGGCCCTGGGCAG ATGGTGCACTTGACTTCTGAGGAGAAGAACTGCATCACTACCATCTGGTCTAAGGT GCAGGTTGACCAGACTGGTGGTGAGGCCCTTGGCAG ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCTGGGGCAAGGT CAATGTGGCCGAATGTGGGGCCGAAGCCCTGGCCAG ATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGTCACCTCTCTGTGGGGCAAGGT GGATGTAGAGAAAGTTGGTGGCGAGGCCTTGGGCAG ATGGTTGCACCTGACTGATGCTGAGAAGTCTGCTGTCTCTTGCCTGTGGGCAAAGG TGAACCCCGATGAAGTTGGTGGTGAGGCCCTGGGCAGG ATGGTGCATCTGTCCAGTGAGGAGAAGTCTGCGGTCACTGCCCTGTGGGGCAAGGT GAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGGC ATGGTGCACCTAACTGATGCTGAGAAGGCTACTGTTAGTGGCCTGTGGGGAAAGGT GAACCCTGATAATGTTGGCGCTGAGGCCCTGGGCAG ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGT GAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGCCTTTTGGGGCAAGGTGAAA GTGGATGAAGTTGGTGGTGAGGCCCTGGGCAG ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTG AACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGG

Goat Opossum Gallus Lemur Mouse Rabbit Rat Gorilla Bovine Chimpanzee

382

B. Liao et al. / Chemical Physics Letters 402 (2005) 380–383

Observing Table 3, we can conclude that the number of base G is the largest among the frequency of occurrences of four letters for each sequence. A interesting phenomenon is also observed for Goat and Gallus b-globin genomes with high G + C content.

3. Application In order to facilitate the quantitative comparison of different species in terms of their collective parameters, we introduce a distance scale as defined below. Suppose that there are two species i and j, the parameters are x00 ðiÞ; y 00 ðiÞ; z00 ðiÞ; s00 ðiÞ and x00 ðjÞ; y 00 ðjÞ; z00 ðjÞ; s00 ðjÞ, respectively. We will illustrate the use of the 4D quantitative characterization of DNA sequences with an examination of similarities/dissimilarties among the 11 coding sequences of Table 1. We construct a four-

d ij ¼

Table 3 The relations between a, c, g, and t Species

Relation

Human Goat Gallus Opossum Lemur Mouse Rabbit Rat Bovine Gorilla Chimpanzee

g>t>c>a c=a=tc>a>t g>t>a>c g>t>a>c g>t>c>a g>t>a>c g>t>a>c g>t>a>c g>t>c>a g>t>c>a

Euclidean distance between the end point of the vectors; (2) we calculate the correlation angle of two vectors. The distance dij between the two centers is

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 2 2 ½x00 ðiÞ  x00 ðjÞ þ ½y 00 ðiÞ  y 00 ðjÞ þ ½z00 ðiÞ  z00 ðjÞ þ ½s00 ðiÞ  s00 ðjÞ :

ð2Þ

The angle hij between the two lines is

ðx00 ðiÞx00 ðjÞ þ y 00 ðiÞy 00 ðjÞ þ z00 ðiÞz00 ðjÞ þ s00 ðiÞs00 ðjÞÞ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : cos hij ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ½ðx00 ðiÞÞ2 þ ðy 00 ðiÞÞ2 þ ðz00 ðiÞÞ2 þ ðs00 ðiÞÞ2  ½ðx00 ðjÞÞ2 þ ðy 00 ðjÞÞ2 þ ðz00 ðjÞÞ2 þ ðs00 ðjÞÞ2 

component vector consisting of the geometrical center. The underlying assumption is that if two vectors point to a similar direction in the 4D space, and then the two DNA sequences represented by the four-component vectors are similar. The similarities among such vectors can be computed in two ways: (1) we calculate the Table 2 The geometrical center of the first exon of b-globin gene belonging to 11 species Species

ðx00 ; y 00 ; z00 ; s00 Þ

Human Goat Gallus Opossum Lemur Mouse Rabbit Rat Bovine Gorilla Chimpanzee

(0.5652, 0.3913, 0.4130, 0.9445) (0.6047, 0.3953, 0.3953, 0.9414)) (0.5761, 0.4674, 0.3696, 0.9445) (0.5435, 0.4457, 0.4674, 0.9445) (0.5870, 0.3696, 0.4565, 0.9445) (0.5426, 0.3936, 0.4255, 0.9455) (0.6000, 0.3667, 0.4111, 0.9435) (0.5761, 0.4130, 0.4457, 0.9445) (0.6047, 0.3837, 0.4070, 0.9414) (0.5806, 0.3871, 0.3978, 0.9450) (0.5810, 0.3810, 0.4190, 0.9501)

ð3Þ

The smaller Euclidean distance, the more similar the DNA sequence. And the smaller correlation angle, the more similar the DNA sequence. That is to say, the distances or angles between evolutionary closely related species are smaller, while those between evolutionary disparate species are larger. Observing Tables 4 and 5, we find gallus is very dissimilar to others among the 11 species because its corresponding row has larger entries. On the other hand, the more similar species pairs are Goat–Bovine, Rabbit–Bovine, Human–Gorilla, Human–Chimpanzee, Gorilla–Chimpanzee, Goat–Gorilla, Rabbit–Chimpanzee, Human–Mouse, Bovine–Gorilla, and Bovine– Chimpanzee. Therefore, the classification of species provided that the numbers of their coding sequences are sufficiently large, can be generally performed in terms of the two matrices as listed in Tables 4 and 5. In other words, with the continuous increase in the number of coding sequences for various species, it is possible to perform the cluster analysis by the distance and angle matrices.

B. Liao et al. / Chemical Physics Letters 402 (2005) 380–383

383

Table 4 The similarity/dissimilarity matrix for the coding sequences of Table 1 based on the Euclidean distances between the end points of the fourcomponent vectors of the geometrical center Species

Human

Goat

Gallus

Opossum

Lemur

Mouse

Rabbit

Rat

Bovine

Gorilla

Chimpanzee

Human Goat Gallus Opossum Lemur Mouse Rabbit Rat Bovine Gorilla Chimpanzee

0

0.043579 0

0.088281 0.081771 0

0.079935 0.107208 0.105349 0

0.053276 0.068766 0.131283 0.088330 0

0.025948 0.069196 0.098461 0.066872 0.059240 0

0.042671 0.033077 0.111512 0.112267 0.047324 0.065036 0

0.040731 0.060671 0.093544 0.051019 0.046033 0.043677 0.062554 0

0.040787 0.016476 0.096083 0.106053 0.054516 0.065677 0.018299 0.056425 0

0.022047 0.025831 0.085228 0.098259 0.061589 0.047474 0.031171 0.054642 0.026267 0

0.020569 0.037464 0.099803 0.089254 0.040045 0.041191 0.025913 0.042335 0.028083 0.022646 0

Table 5 The similarity/dissimilarity matrix for the coding sequences of Table 1 based on the angle between the four-component vectors of the geometrical center Species

Human

Goat

Gallus

Opossum

Lemur

Mouse

Rabbit

Rat

Bovine

Gorilla

Chimpanzee

Human Goat Gallus Opossum Lemur Mouse Rabbit Rat Bovine Gorilla Chimpanzee

0

0.033716 0

0.069350 0.065070 0

0.059865 0.084249 0.083139 0

0.039997 0.054562 0.104476 0.069626 0

0.020677 0.054158 0.077078 0.046925 0.043819 0

0.033814 0.026266 0.088762 0.087907 0.036731 0.051560 0

0.026783 0.047424 0.074160 0.040193 0.036357 0.027138 0.048268 0

0.031345 0.013171 0.076521 0.083332 0.043135 0.051209 0.014158 0.043997 0

0.017759 0.018942 0.067014 0.075512 0.047312 0.038099 0.024568 0.039990 0.019274 0

0.014286 0.029950 0.079465 0.069611 0.031263 0.030970 0.020632 0.032129 0.022432 0.016644 0

4. Conclusion

References

We present a 4D representation of DNA sequences and a similarity measure between DNA sequences. The advantage of our approach is that it allow visual inspection of data, helping in recognizing major similarities among different DNA sequences. It is well-known that the alignments of DNA sequences are computer intensive that is direct comparison for DNA sequences. Sequences considered in alignment of DNA sequences are only stringÕs sequences. Here, we use an approach which shall consider not only sequencesÕ structure but also chemical structure for DNA sequences, and in which the computation is simple for long sequence.

[1] Chunxin Yuan, Bo Liao, Tianming Wang, Chem. Phys. Lett. 379 (2003) 412. [2] Bo Liao, Tianming Wang, J. Comput. Chem. 25 (2004) 1364. [3] Bo Liao, Tianming Wang, J. Mol. Struct.: Theochem. 681 (2004) 209. [4] Bo Liao, Tianming Wang, Chem. Phys. Lett. 388 (2004) 195. [5] Bo Liao, Tianming Wang, J. Chem. Inf. Comput. Sci. 44 (2004) 1666. [6] Bo Liao, Chem. Phys. Lett. 401 (2005) 196. [7] A. Nandy, Curr. Sci 66 (2004) 309. [8] Stephn S.-T. Yan, JiaSong Wang, Air Niknejad, Chaoxiao Lu, Ning Jin, Yee-kin Ho, Nucleic Acid Res. 31 (2003) 3078. [9] M. Randic, M. Vracko, A. Nandy, S.C. Basak, J. Chem. Inf. Comput. Sci 40 (2000) 1235. [10] Milan Randic, Majan Vracko, Nella Lers, Dejan Plavsic, Chem. Phys. Lett. 368 (2003) 1. [11] E. Hamori, J. Ruskin, J. Biol. Chem. 258 (1983) 1318. [12] E. Hamori, Nature 314 (1985) 585. [13] M.A. Gates, Nature 316 (1985) 219. [14] A. Nandy, Comput. Appl. Biosci. 12 (1996) 55. [15] R. Zhang, C.T. Zhang, J. Biomol. Struct. Dyn 11 (1994) 767. [16] M. Randic, Alexandru T. Balaban, J. Chem. Inf. Comput. Sci 40 (2000) 50. [17] M. Randic, Marjan Vracko, Jure Zupan, Marjana Novic, Chem. Phys. Lett. 373 (2003) 558.

Acknowledgement The author thank the anonymous referees for many valuable suggestions that have improved this manuscript. This work is supported by grant A0324670 from the National Natural Science Foundation of China.