Chemical Physics Letters 388 (2004) 195–200 www.elsevier.com/locate/cplett
Analysis of similarity/dissimilarity of DNA sequences based on 3-D graphical representation Bo Liao *, Tian-ming Wang Department of Applied Mathematics, Dalian University of Technology, Dalian of Liaoning, Dalian 116024, China Received 22 November 2003; in final form 14 February 2004 Published online: 20 March 2004
Abstract Recently, we proposed a 3-D graphical representation of DNA sequences [Chem. Phys. Lett. 379 (2003) 412]. Based on this representation, we outline one such approach by constructing a 15-component vector whose components are the average bandwidth of the D=D matrices. The examination of similarities/dissimilarities among the coding sequences of the first exon of b-globin gene of different species illustrates the utility of the approach. Ó 2004 Elsevier B.V. All rights reserved.
1. Introduction In recent years, several authors outlined different graphical representations of DNA sequences based on 2-D, 3-D or 4-D [1,2,4–6,11–14]. The advantage of graphical representation of DNA sequences [1,2,4–8] is that they allow visual inspection of data, helping in recognizing major differences among similar DNA sequences. But both 2-D and 3-D graphical representation are accompanied with some loss of information due to overlapping and crossing of the curve representing DNA with itself. Randic [2] has presented a novel 2-D graphical representation, which avoids the limitation of NandyÕs approach, and outlined an approach to analysis the similarity among the coding sequences of the first exon of b-globin gene of 11 different species. Recently, we proposed a new 3-D graphic representation [1], which also avoids the limitations associated with crossing and overlapping, and means that the M=M matrices and L=L matrices can be more easily obtained. In this Letter, we will make a comparison for the first exon of b-globin genes sequences belonging to 11 different species based on a 3-D graphical representation. *
Corresponding author. Fax: +(86)411-4706100. E-mail addresses:
[email protected] (B. Liao), wangtm@ dlut. edu.cn (T.-m. Wang). 0009-2614/$ - see front matter Ó 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.cplett.2004.02.089
In Table 1, the first exon-1 of the b-globin gene for 11 different species are listed, which were reported by Randic [3]. Based on the order of A, G, T and C, we shall reduce a DNA primary sequence into three characteristic curve. Each characteristic curve may be regarded as a coarse grained description of the DNA primary sequence. We construct a 15-component vector consisting of the average bandwidths of the D=D matrices. The similarities are computed by calculating the Euclidean distance between the end point of the vectors or calculating the correlation angle of two vectors.
2. 3-D graphical representation of DNA sequences First, we assign one nucleic base as follows: ð1; 0; 0Þ ! A; ð1; 0; 0Þ ! G; ð0; 1; 0Þ ! T; ð0; 1; 0Þ ! C; while the corresponding curves extend along the z axes. In detail, let G ¼ g1 g2 . . . be an arbitrary DNA primary sequence. Then we have a map /, which maps G into a plot set. We call the corresponding plot set characteristic plot set. The curve connecting all plots of the characteristic plot set in turn is called characteristic curve. Bases of DNA can be classed into groups, purine(A,G)/pyrimidine(C,T), amino(A,C)/keto(G,T) and
196
B. Liao, T. Wang / Chemical Physics Letters 388 (2004) 195–200
Table 1 The coding sequences of the first exon of b-globin gene of 11 different species Species
Coding sequence
Human
ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGT GAACGTGGATTAAGTTGGTGGTGAGGCCCTGGGCAG ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGGCTTCTGGGGCAAGGTGAAAGT GGATGAAGTTGGTGCTGAGGCCCTGGGCAG ATGGTGCACTTGACTTCTGAGGAGAAGAACTGCATCACTACCATCTGGTCTAAGGT GCAGGTTGACCAGACTGGTGGTGAGGCCCTTGGCAG ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCTGGGGCAAGGT CAATGTGGCCGAATGTGGGGCCGAAGCCCTGGCCAG ATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGTCACCTCTCTGTGGGGCAAGGT GGATGTAGAGAAAGTTGGTGGCGAGGCCTTGGGCAG ATGGTTGCACCTGACTGATGCTGAGAAGTCTGCTGTCTCTTGCCTGTGGGCAAAGG TGAACCCCGATGAAGTTGGTGGTGAGGCCCTGGGCAGG ATGGTGCATCTGTCCAGTGAGGAGAAGTCTGCGGTCACTGCCCTGTGGGGCAAGGT GAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGGC ATGGTGCACCTAACTGATGCTGAGAAGGCTACTGTTAGTGGCCTGTGGGGAAAGGT GAACCCTGATAATGTTGGCGCTGAGGCCCTGGGCAG ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGT GAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGCCTTTTGGGGCAAGGTGAAA GTGGATGAAGTTGGTGGTGAGGCCCTGGGCAG ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTG AACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGG
Goat Opossum Gallus Lemmur Mouse Rabbit Rat Gorilla Bovine Chimpanzee
week-H bond(A,T)/strong-H band(G,C). We can obtain only three representations corresponding to the three classifications. Theorem. A DNA primary sequence is determined by any pair of its three characteristic curve. Proof. Let G ¼ g1 g2 . . . be an arbitrary DNA primary sequence. Then we have a map /j ; j ¼ 1; 2; 3, which maps G into a set of triplet. Explicitly, /j ðGÞ ¼ /j ðg1 Þ/j ðg2 Þ . . ., where 8 ð1; 0; iÞ if gi ¼ A > > < ð1; 0; iÞ if gi ¼ G /1 ðgi Þ ¼ ð0; 1; iÞ if gi ¼ T > > : ð0; 1; iÞ if gi ¼ C 8 ð1; 0; iÞ if gi ¼ A > > < ð1; 0; iÞ if gi ¼ C /2 ðgi Þ ¼ ð0; 1; iÞ if gi ¼ T > > : ð0; 1; iÞ if gi ¼ G 8 ð1; 0; iÞ if gi ¼ A > > < ð1; 0; iÞ if gi ¼ T /3 ðgi Þ ¼ ð0; 1; iÞ if gi ¼ G > > : ð0; 1; iÞ if gi ¼ C Thus every gi corresponds to a triplet ð/1 ðgi Þ; /2 ðgi Þ; /3 ðgi ÞÞ, from which the theorem follows immediately. Map /1 ; /2 ; /3 corresponds to the pattern ATGC, ATCG, AGTC, respectively. In Figs. 1–3, we show the characteristic curves that represent the first 10 bases of the coding sequence of the first exon of b-globin gene.
Fig. 1. Characteristic curve of the sequence ATGGTGCACC based on pattern ATGC, the dots denote the bases making up the sequence.
3. Characterization of DNA sequences with sequence invariants The sequence invariant that we will introduce will be explained on the 10 10 fragment of the distance matrix of Table 2. One can observe that in each row of the table the entries increase from left to right. The matrix can easily be rearranged by first placing the smallest entry pffiffiffi (either 1 or 3) nextpto diagonal, then the next ffiffiffi thepmain ffiffiffi smallest entry (2 or 6 or 8) next to the first, and so on till all entries are arranged in increasing order as we move from the diagonal zero to the right.
B. Liao, T. Wang / Chemical Physics Letters 388 (2004) 195–200
Fig. 2. Characteristic curve of the sequence ATGGTGCACC based on pattern ATCG, the dots denote the bases making up the sequence.
197
zeroes. In the case of Table 3, we obtain for the first few neighbouring thepffiffiffiffiffifollowing sums: pffiffiffi pffiffiffi diagonal pffiffiffi pffiffiffiffiffi ð2 þ 7 3Þ; ð4 þ 3 6 þ 3 8Þ; ð9 þ 3 11 þ 13Þ, and so on. We will refer to these sums as band invariants pffiffiof ffi ; b ; b for ð2 þ 7 3Þ; different width, specifically b 1 2 3 pffiffiffiffiffi pffiffiffiffiffi pffiffiffi pffiffiffi ð4 þ 3 6 þ 3 8Þ; ð9 þ 3 11 þ 13Þ. One can observe that these quantities are not matrix invariants but can always be extracted from any matrix in whatever form it is presented by considering adjacency between sequence elements. If the distance matrix is already in the canonical form based on assigning labels to nucleic acid bases sequentially, band invariants can readily be obtained by summing elements along each of the lines parallel to the main diagonal. In Table 3, we show the band invariants based on 10 10 fragments of the sequences of Human and Goat. As one immediately sees, there is considerable variation in the form and numerical values for bandwidth invariants belonging to different sequences. From Tables 4–6, we list the numerical values for the average bandwidths.
4. Similarities/dissimilarities among the coding sequences of the first exon of b-gene of 11 species
Fig. 3. Characteristic curve of the sequence ATGGTGCACC based on pattern AGTC, the dots denote the bases making up the sequence.
In the following, we will assume that all distance matrices considered have already been so ordered. We can now consider the sums of elements in diagonal entries parallel to the main diagonal, which consists of
We will illustrate the use of the 3-D quantitative characterization of DNA sequences with an examination of similarities/dissimilarities among the 11 coding sequences of Table 1. We construct a 15-component vectors consisting of the average bandwidths using the full DNA sequence. The underlying assumption is that if two vectors point to a similar direction in 15-dimensional space, then the two DNA sequences represented by the 15-component vectors are similar. The similarities among such vectors can be computed in three ways: (1) we calculate the Euclidean distance between the end point of the vectors; (2) we calculate the correlation angle of two vectors; (3) we calculate the cosine of the correlation angle of two vectors. When one calculates the correlation angle of two vectors, the cosine of the correlation angle of two vectors is easily obtained. The smaller the Euclidean distance between
Table 2 The upper triangles of the E matrices of the sequence ATGGTGCACC based on pattern ATGC Base
A
A T G G T G C A C C
0
T pffiffiffi 3 0
G pffiffiffi p8ffiffiffi 3 0
G pffiffiffiffiffi p13 ffiffiffi 6 1 0
T pffiffiffiffiffi 18 3pffiffiffi p6ffiffiffi 3 0
G pffiffiffiffiffi p29 ffiffiffiffiffi 18 3 2pffiffiffi 3 0
C pffiffiffiffiffi p38 ffiffiffiffiffi p29 ffiffiffiffiffi p18 ffiffiffiffiffi p11 ffiffiffi p8ffiffiffi 3 0
A 7pffiffiffiffiffi p38 ffiffiffiffiffi p29 ffiffiffiffiffi p20 ffiffiffiffiffi p11 ffiffiffi p8ffiffiffi 3 0
C pffiffiffiffiffi p66 ffiffiffiffiffi p53 ffiffiffiffiffi p38 ffiffiffiffiffi p27 ffiffiffiffiffi p20 ffiffiffiffiffi 11 2pffiffiffi 3 0
C pffiffiffiffiffi p83 ffiffiffiffiffi p68 ffiffiffiffiffi p51 ffiffiffiffiffi p38 ffiffiffiffiffi p29 ffiffiffiffiffi 18 3pffiffiffi 6 1 0
198
B. Liao, T. Wang / Chemical Physics Letters 388 (2004) 195–200
Table 3 Expression for band average widths 1–9 for the 10 10 fragment of the distance matrices of human and goat based on pattern ATGC Band
Human pffiffiffi ð2 þ 7pffiffi3ffiÞ=9 pffiffiffi ð4 þ 3pffiffiffiffiffi 6 þ 3pffiffiffiffiffi 8Þ=8 ð9p þffiffiffiffiffi 3 11p þffiffiffiffiffi 13Þ=7 ð4 2 ffiffiffiffiffi20Þ=6 pffiffiffiffiffi18 þ p ð p 27 ffiffiffiffiffiþ 4 29Þ=5 ð4 38 pÞ=4 ffiffiffiffiffi pffiffiffiffiffi ð7 pþ ffiffiffiffiffi 51 pþ ffiffiffiffiffi 53Þ=8 ðpffiffiffiffiffi 66 þ 68Þ=2 ð 83Þ=1
1 2 3 4 5 6 7 8 9
Total
Goat pffiffiffi pffiffiffi ð3pffiffi5ffi þ 6pffiffi3ffiÞ=9 ð2 8 þ 6 6Þ=8pffiffiffiffiffi pffiffiffiffiffi ð6 þ 3 p 11 13Þ=7 ffiffiffiffiffiþ 2pffiffiffiffiffi ð12 þ 2 18 þ 20 Þ=6 pffiffiffiffiffi pffiffiffiffiffi ð3 27pþffiffiffiffiffi2 29 pÞ=5 ffiffiffiffiffi ð6 þ 2pffiffiffiffiffi 38 þ 40Þ=4 ð14 pþ ffiffiffiffiffi 51Þ=3 ð2pffiffiffiffiffi66Þ=2 ð 85Þ=1
1.5694 2.4792 3.2222 4.3191 5.3474 6.1644 7.1405 8.1851 9.1104
Total 1.9001 2.5442 3.3087 4.1596 5.2718 6.1634 7.0471 8.1240 9.2195
Table 4 Initial band average width for the distance matrices of the first exon of b-globin gene of 11 species of Table 1 based on pattern ATGC Band
Human
Goat
Gallus
Opossum
Lemur
Mouse
Rabbit
Rat
Bovine
Gorilla
Chimpanzee
1 2 3 4 5
1.6528 2.4380 3.3053 4.2825 5.1946
1.6882 2.4874 3.2610 4.2652 5.2111
1.6639 2.4480 3.2858 4.2904 5.1920
1.7318 2.4748 3.2651 4.2876 5.1809
1.7293 2.4157 3.2985 4.2445 5.1854
1.6762 2.4465 3.2894 4.2548 5.2032
1.6706 2.4302 3.2984 4.2611 5.2011
1.6528 2.4380 3.3053 4.2825 5.1946
1.6445 2.4668 3.2677 4.2593 5.2063
1.6567 2.4764 3.2788 4.2632 5.2008
1.6469 2.4799 3.2824 4.2534 5.2039
Table 5 Initial band average width for the distance matrices of the first exon of b-globin gene of 11 species of Table 1 based on pattern ATCG Band
Human
Goat
Gallus
Opossum
Lemur
Mouse
Rabbit
Rat
Bovine
Gorilla
Chimpanzee
1 2 3 4 5
1.6891 2.4262 3.2758 4.2583 5.1813
1.6704 2.4468 3.2575 4.2512 5.1971
1.6694 2.4438 3.2826 4.2669 5.1746
1.7484 2.4706 3.2618 4.2746 5.1722
1.7127 2.3946 3.2822 4.2393 5.1768
1.7033 2.4465 3.2925 4.2370 5.1947
1.7102 2.3958 3.2718 4.2397 5.1922
1.6805 2.4422 3.2891 4.2591 5.2033
1.6505 2.4352 3.2502 4.2509 5.1969
1.6950 2.4264 3.2628 4.2554 5.1880
1.6905 2.4505 3.2683 4.2488 5.1907
Table 6 Initial band average width for the distance matrices of the first exon of b-globin gene of 11 species of Table 1 based on pattern AGTC Band
Human
Goat
Gallus
Opossum
Lemur
Mouse
Rabbit
Rat
Bovine
Gorilla
Chimpanzee
1 2 3 4 5
1.5894 2.4725 3.3017 4.2479 5.1921
1.6111 2.5054 3.2575 4.2540 5.2158
1.6251 2.4901 3.3085 4.2695 5.1876
1.6266 2.4748 3.2943 4.2537 5.1831
1.6352 2.3988 3.2822 4.2341 5.1920
1.5949 2.4589 3.2894 4.2421 5.2138
1.5913 2.4302 3.2984 4.2450 5.1922
1.6251 2.4633 3.2826 4.2851 5.2098
1.5793 2.4758 3.2433 4.2369 5.2016
1.5855 2.4639 3.2981 4.2451 5.1944
1.5839 2.4615 3.2994 4.2466 5.1964
the end points of two vectors, the more similar are the DNA sequences. The smaller the correlation angle between two vectors, the more similar the DNA sequence. On the other hand, the larger the cosine of the correlation angle between two vectors, the more similar are the DNA sequences. In Table 7, we give the similarities and dissimilarities for the coding sequences of Table 1 based on the Euclidean distances between the end points of the
15-component vectors of the average bandwidth. We believe that it is not accidental that the smallest entries in Table 7 are associated with the pairs (Gorilla, Chimpanzee), (Human, Gorilla), (Mouse, Chimpanzee) and (Mouse, Gorilla). On the other hand the larger entries in the similarity/dissimilarity matrix appear in the rows belonging to Opossum and Gallus. In Table 8, the similarities and dissimilarities for 11 coding sequences based on the correlation angle between
B. Liao, T. Wang / Chemical Physics Letters 388 (2004) 195–200
199
Table 7 Similarity/dissimilarity between the 11 species of Table 1 based on the Euclidean distances between the end points of the 15-component vectors of the average bandwidth using the full DNA sequence Species
Human
Goat
Gallus
Opossum
Lemur
Mouse
Rabbit
Rat
Bovine
Gorilla
Chimpanzee
Human Goat Gallus Opossum Lemur Mouse Rabbit Rat Bovine Gorilla Chimpanzee
0
0.108362 0
0.060439 0.096006 0
0.130445 0.122176 0.118545 0
0.134172 0.168879 0.151975 0.147262 0
0.064115 0.094323 0.088314 0.117864 0.11539 0
0.067734 0.134167 0.109775 0.141904 0.088463 0.069820 0
0.066427 0.105127 0.062052 0.136985 0.139818 0.075291 0.095733 0
0.096158 0.074859 0.110794 0.163541 0.170744 0.098238 0.118901 0.108381 0
0.054547 0.085166 0.079944 0.119458 0.138215 0.061447 0.073506 0.087454 0.077639 0
0.065209 0.088577 0.08550 0.127640 0.150734 0.059135 0.088815 0.088087 0.078311 0.030419 0
Table 8 The similarity/dissimilarity matrix for the 11 species of Table 1 based on the angle between the end points of the 15-component vectors of the average bandwidth using the full DNA sequence Species
Human
Goat
Gallus
Opossum
Lemur
Mouse
Rabbit
Rat
Bovine
Gorilla
Chimpanzee
Human Goat Gallus Opossum Lemur Mouse Rabbit Rat Bovine Gorilla Chimpanzee
0
0.007723 0
0.004180 0.006883 0
0.009273 0.008757 0.008502 0
0.009288 0.011612 0.010250 0.009840 0
0.004592 0.006743 0.006283 0.008403 0.007757 0
0.004669 0.009359 0.007466 0.009838 0.006244 0.004718 0
0.004277 0.007445 0.004348 0.009786 0.008869 0.005102 0.005928 0
0.006631 0.004546 0.007326 0.011294 0.012272 0.006681 0.008533 0.006644 0
0.003893 0.005960 0.005511 0.008400 0.009701 0.004345 0.005200 0.005737 0.005374 0
0.004679 0.006258 0.005984 0.009040 0.010569 0.004213 0.006274 0.005885 0.005351 0.002172 0
two vectors are shown. Observing Table 8, we find Opossum and Gallus are very dissimilar to others among the 11 species because their corresponding rows have larger entries. On the other hand, the more similar species pairs are Gorilla–Chimpanzee, Human–Gorilla, Mouse–Chimpanzee and Mouse–Gorilla. Similar results have been obtained by Randic [3,4,9,10]. However, some disappointing values still occur, values such as the Human–Gallus and Gallus–Rat are found in the small entry in Tables 7 and 8. The reason for this may be as follows: (1) There is some loss of information associated with the distance matrix. (2) Information extracted from each structure is not enough to compare 11 species. The similarities based on the full DNA primary sequence are somewhat altered when the full DNA primary sequence is compared with the initial part of the structure, as illustrated in Fig. 4. Entries that remain close to the line y ¼ x indicate pairs of DNA sequences which have not varied much after the initial changes. Entries that are at greater distance from the line y ¼ x indicate DNA sequences that show considerable variation not only at the initial fragments, but also throughout their length. In Fig. 4, we show a plot of bandwidths 1–5 for the initial fragment of length 10 of
Human against the bandwidths 1–5 for the full length of Human based on pattern ATGC. Randic used 12-component vector whose components are made up of the normalized leading eigen value [3], the 16-component vector whose components are made
Fig. 4. Plot of initial average bandwidths 1–5 as computed from the complete DNA sequence of Human versus the corresponding bandwidths of the subsequence with length 10.
200
B. Liao, T. Wang / Chemical Physics Letters 388 (2004) 195–200
up of the frequency of occurrence of all possible ordered pairs of adjacent bases [9], and the n-component (with n ¼ 5; 10; 15) vector whose components are made up of the average bandwidth using the full (or fragment) DNA sequence [4], with the 64 components consisting of the frequency of occurrence of all ordered triplets of bases (segments of length 3) [10]. We use a 15-component vector whose components are made up of the average bandwidth using the full DNA sequence. But there exists an overall qualitative agreement among similarities based on different descriptors despite some variations among them.
5. Conclusion We present a similarity measure between DNA sequences. The advantage of our approach is that it allow visual inspection of data, helping in recognizing major similarities among different DNA sequences. It is wellknown that the alignments of DNA sequences are computer intensive that is direct comparison for DNA sequences. Sequences considered in alignment of DNA sequences are only stringÕs sequences. Here, we use an approach which shall consider not only sequencesÕ structure but also chemical structure for DNA se-
quences. The sequence invariant easily computed and compared is applied to compare DNA sequences, rather than stringsÕ sequence themselves.
References [1] Chunxin Yuan, Bo Liao, Tianming Wang, Chem. Phys. Lett. 379 (2003) 412. [2] M. Randic, M. Vracko, N. Lers, D. Plavsic, Chem. Phys. Lett. 368 (2003) 1. [3] M. Randic, M. Vracko, N. Lers, D. Plavsic, Chem. Phys. Lett. 371 (2003) 202. [4] M. Randic, A.T. Balanba, J. Chem. Inf. Comput. Sci. 40 (2000) 50. [5] M. Randic, M. Vracko, A. Nandy, S.C. Basak, J. Chem. Inf. Comput. Sci. 40 (2000) 1235. [6] A. Nandy, Curr. Sci. 66 (1994) 309. [7] A. Nandy, P. Nandy, Curr. Sci. 68 (1995) 75. [8] A. Nandy, Curr. Sci. 70 (1996) 661. [9] M. Randic, J. Chem. Inf. Comput. Sci. 40 (2000) 50. [10] M. Randic, X.F. Guo, S.C. Basak, J. Chem. Inf. Comput. Sci. 41 (2001) 619. [11] R. Zhang, C.T. Zhang, J. Biomol. Str. Dyn. 11 (4) (1994) 767. [12] M.L. Lan, M.S.T. Carpendale, Supporting Detail-in-Context for the DNA Representation, H-Curves, 1998. [13] M. Randic, M. Vracko, M. Novic, in: M.V. Diudea (Ed.), QSPR/ QSAR Studies by Molecular Descriptors, Nova Science, Huntington, NewYork, 2001, p. 145. [14] E. Hamori, J. Ruskin, J. Biol. Chem. 258 (1983) 1318.