Mathematical Biosciences 241 (2013) 217–224
Contents lists available at SciVerse ScienceDirect
Mathematical Biosciences journal homepage: www.elsevier.com/locate/mbs
C-curve: A novel 3D graphical representation of DNA sequence based on codons Nafiseh Jafarzadeh, Ali Iranmanesh ⇑ Department of Mathematics, Faculty of Mathematical Sciences, Tarbiat Modares University, P.O. Box 14115-137, Tehran, Iran
a r t i c l e
i n f o
Article history: Received 12 June 2012 Received in revised form 18 November 2012 Accepted 26 November 2012 Available online 13 December 2012 Keywords: DNA sequence 3D codon curve p-Matrix s-Matrix Similarities/dissimilarities
a b s t r a c t In this paper, a novel 3D graphical representation of DNA sequence based on codons is proposed. Since there is not loss of information due to overlapping and containing loops, this representation will be useful for comparison of different DNA sequences. This 3D curve will be convenient for DNA mutations comparison specially. In continues we give a numerical characterization of DNA sequences based on the new 3D curve. This characterization facilitates quantitative comparisons of similarities/dissimilarities analysis of DNA sequences based on codons. Ó 2012 Elsevier Inc. All rights reserved.
1. Introduction DNA is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms. DNA is a polymer. The monomer units of DNA are nucleotides. There are four different types of nucleotides found in DNA, differing only in the nitrogenous base. The four nucleotides are given one letter abbreviations as shorthand for the four bases: A is for adenine, G is for guanine, C is for cytosine, T is for thymine. They are often called bases. A and T are complement, also G and C. In the recent years, a rapid growth of sequence data in DNA databases has been observed. We know that it is difficult to obtain information directly from the primary sequences. Then mathematical analysis of the large volume of sequences data becomes one of the challenges for bio-scientists. Using graphical approaches to study biological problems can provide intuitive picture or useful insights for helping analyzing complicated relations in these systems. Graphical representation of DNA sequenced was first proposed by Hamori and Ruskin [1]. Gates [2], Nandy [3], Leong and Morgenthaler [4] developed 2D graphical representations of DNA sequences. As for the 3D graphical representation, Hamori and Ruskin [1] developed the H-curve. After that Zhang [5,6] created the Z-curve to represent DNA sequences in a 3D space. According to the dimensions of the space in which the sequences are plotted, all the graphical representations can be classified into five categories ranging from 2D to 6D [7].
⇑ Corresponding author. Tel./fax: +98 2182883493. E-mail address:
[email protected] (A. Iranmanesh). 0025-5564/$ - see front matter Ó 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.mbs.2012.11.009
In general, many advances in 2D, 3D, and 4D, DNA sequences representation appeared after the initial works [8–18]. Up to now, many papers published in DNA sequence. For example, see [19–29]. Motivated by these aforementioned works, we proposed a novel 2D graphical representation based on codons [30]. Codon is a specific sequence of three adjacent nucleotides on the mRNA that specifies the genetic code information for synthesizing a particular amino acid. See the codon table in Fig. 1. Comparing with individual nucleotide and dinucleotide, codon has more advantages in sequence analysis [31]. For example, the genetic code consists of codons of mRNA, and mutations are linked to codons changes. According to our previous work [30], in this paper we propose a novel 3D graphical representation based on codon (which we call it C-curve) without loss of information. The advantages of this novel 3D graphical representation compared to the previous 2D graphical representation is that the C-curve contains no loops because the value of ‘‘i’’ is unique for each point while in 2D representation it is possible that the coordinates of two consecutive points be equal. In addition, this 3D representation has no overlapping and crossing with itself since the value of ‘‘i’’ in each point (x, y, i) is unique and increasing but in 2D representation overlapping and crossing are possible occur and it may lose some information. In addition, application of this new 3D representation in quantitative comparisons of similarities/dissimilarities analysis of DNA sequences in eleven exons will show another advantage. Comparing with other models, C-curve has the following merits: (a) This model allows direct inspection of compositions and distributions of codons in DNA sequences. (b) From this representation, some alternative parameters can be deduced, which can be used to denote global and local information of DNA sequences.
218
N. Jafarzadeh, A. Iranmanesh / Mathematical Biosciences 241 (2013) 217–224
Figure 1. Codon table.
(c) Based on C-curve, a simple approach is outlined for analysis of similarities/dissimilarities of DNA sequences among different species by numerical representation. In [30], we distributed each kind of 64 codons in Cartesian 2D coordinates as shown in Fig. 2. In this paper, using Fig. 2, first we give a new 3D graphical representation (C-curve) of DNA sequences based on codons, after that we transform the graphical representation into two matrices to facilitate quantitative comparisons of DNA sequences and we give a numerical representation based on C-curve for DNA sequences.
In bioinformatics, there are several methods for comparison genetic sequences. The most popular tools for comparing sequences are alignment methods. A sequence alignment is a way of arranging the sequence of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alignment-free sequence comparison is frequently used to compare genomic sequences and in particular, gene regulatory regions. Gene regulatory regions are generally not highly conserved making alignmentbased methods for the identification of gene regulatory regions less efficient [32]. Alignment-free sequence comparison has a relatively free sequence comparison has a relatively long history starting in the mid-1980s [33], see for example the review in [34]. Most of the alignment-based methods compared with alignment-free methods take more computational time. Moreover, another advantage of alignment-free methods to be noted is their sensitivity against short or partial sequences [35]. Our method in this paper is alignment-free based on codons which compared with our previous work [30] and other alignment-free methods such as [25,27,28] takes less computational time, although the matrices in this paper are large but calculating distances between two matrices is faster than calculating some descriptors. Also, this method will be useful for consider mutations in genomes.
Table 1 Values of corresponding parameters of C-curve of sequence ‘‘S’’. Codons
x
y
i
z
x0
y0
z0
TGT GCT GAG TGA GGT CAG AAA CTT
1 4 2 2 4 2 3 1
2 2 4 2 3 1 3 3
1 2 3 4 5 6 7 8
2 8 8 4 12 2 9 3
1 3 5 3 7 9 6 7
2 0 4 2 5 4 7 4
2 10 18 22 34 32 23 20
Figure 2. Sixty-four kinds of codons distributed in Cartesian 2D coordinates.
N. Jafarzadeh, A. Iranmanesh / Mathematical Biosciences 241 (2013) 217–224
219
Figure 3. C-curve of sequence ‘‘S’’.
Figure 4. 3D-curve of sequence ‘‘S’’ based on (x0 , y0 , i).
Table 2 The 64 kinds of codons classified into two groups in three ways by x, y and z. Groups
X
Weak H-bond
AGC AGG AAG AAC AGT AGA AAA AAT ACT ACA ATA ATT ACC ACG ATG ATC TGC TGG TAG TAC TGT TGA TAA TAT TCT TCA TTA TTT TCC TCG TTG TTC GGC GGG GAG GAC GGT GGA GAA GAT GCT GCA GTA GTT GCC GCG GTG GTC CGC CGG CAG CAC CGT CGA CAA CAT CCT CCA CTA CTT CCC CCG CTG CTC
Strong H-bond
Purine
Pyrimidine
Amino
Keto
Y AGC AGG AAG AAC AGT AGA AAA AAT ACT ACA ATA ATT ACC ACG ATG ATC GGC GGG GAG GAC GGT GGA GAA GAT GCT GCA GTA GTT GCC GCG GTG GTC CGC CGG CAG CAC CGT CGA CAA CAT CCT CCA CTA CTT CCC CCG CTG CTC TGC TGG TAG TAC TGT TGA TAA TAT TCT TCA TTA TTT TCC TCG TTG TTC Z AGC AGG AAG AAC AGT AGA AAA AAT ACT ACA ATA ATT ACC ACG ATG ATC CGC CGG CAG CAC CGT CGA CAA CAT CCT CCA CTA CTT CCC CCG CTG CTC GGC GGG GAG GAC GGT GGA GAA GAT GCT GCA GTA GTT GCC GCG GTG CTC TGC TGG TAG TAC TGT TGA TAA TAT TCT TCA TTA TTT TCC TCG TTG TTC
Table 3 The coding sequences of the first exon of b-globin gene of eleven different species. Species
Coding sequence
Human
ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGT GGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAG Chimpanzee ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTG GGGCAA GGTGAACGTGGATGAAGTTGGTGGTGAGGGCCCTGGGCAGGTTGGT ATCAAGG ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCTG TGGGGAAAGGTGAACTCCGATGAAGTTGGTGGTGAGGCCCTGGGCAG ATGGTGCACCTAACTGATGCTGAGAAGGCTACTGTTAGTGGCCTGTGG GGAAAGGTGAACCCTGATAATGTTGGCGCTGAGGCCCTGGGC ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTC TGGGGCAAGGTCAATGTGGCCGAATGTGGGGCCGAAGCCCTGGCC ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGG GGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGCCTTTTGGGGCAAG GTGAAAGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAG ATGCTGACTGCTGAGGAGAAGGCTGCCGTGACCGGCTTCTGGGGC AAGGTGAAAGTGGATGAAGTTGGTGCTGAGGCCCTGGGCAG ATGGTGCATCTGTCCAGTGAGGAGAAGTCTGCGGTCACTGCCCTGT GGGG CAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGGC ATGGTGCACTTGACTTCTGAGGAGAAGAACTGCATCACTACCATCT GGTCT AAGGTGCAGGTTGACCAGACTGGTGGTGAGGCCCTTGGCAG ATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGTCACCTCTCTGT GGGGCA AGGTGGATGTAGAGAAAGTTGGTGGCGAGGCCTTGGGCAG
Mouse Rat Gallus Gorilla Bovine Goat Rabbit
Opossum
Lemur
220
N. Jafarzadeh, A. Iranmanesh / Mathematical Biosciences 241 (2013) 217–224
2. Construction of C-curve In this section, we give the detail of presentation of 3D directed graphical representation of DNA based on codons. For this purpose, let S = C1 C2 . . . CN be the mRNA sequence which transcribed from a DNA sequence with N codons. We have plot sets, u(S) = u(C1) u(C2) . . . u(CN) to convert S to a 3D curve, where u(Ci) = (xi, yi, i)
and (xi, yi) is the 2D coordinates of codon Ci as introduced in Fig. 2, and i = 1, 2, 3, . . . , N. The curve composed of all the dots of u is the novel graphical representation (for convenience, we call it C-curve in this paper, i.e., curve based on codons). For example take S = TGTGCTGAGTGAGGTCAGAAACTT, here we have eight codons: C1 = TGT, C2 = GCT, C3 = GAG, C4 = TGA, C5 = GGT, C6 = CAG, C7 = AAA, C8 = CTT.
Figure 5. C-curve of the first exon of b-globin gene of eleven different species.
N. Jafarzadeh, A. Iranmanesh / Mathematical Biosciences 241 (2013) 217–224
In Table 1, corresponding values of x, y and i of sequence ‘‘S’’ are listed and Fig. 3, presents its C-curve. In the following, we will discuss about some important properties of C-curve: Discussion 1. According to Fig. 2, if x > 0, the first base of given codon must be A or T, otherwise G or C. similar results can be obtained for y, if y > 0, the first base is A or G, otherwise C or T. Each codon is represented by integration of x and y. Therefore, one can obtain more information from these parameters. Letting x0 i = Rin¼1 xn and y0 i = Rin¼1 yn, we can obtain the cumulative effect of given sequence and inspect both the local and overall information of DNA sequence. Table 1 and Fig. 4, show the corresponding results based on (x0 , y0 , i). Discussion 2. As we know in DNA sequences, the four DNA bases (A, G, T and C) can be classified by the following three ways according to their chemical properties: (i) Chemical structure of the bases: (purine) = A, G/(pyrimidines) = T, C; (ii) Functional groups of the bases: (amino) = A, C/ (keto) = G, T; (iii) The strength of the hydrogen bonds between paired bases: (strong H-bonds) = G, C/(weak H-bonds) = A, T. According to Discussion 1, and from Fig. 2, we find that when x > 0, the first base of codon is A or T, while when x < 0, the first base is G or C, and then x divides the 64 kinds of codons into two groups, i.e., weak H-bonds/strong H-bonds groups. Similarly, y > 0 denotes the first base of codon is A or G, while y < 0 denotes the first base is C or T, and the 64 kinds of codons are also divided into two groups by y, i.e., purine/pyrimidine groups. Now we define z = x y and z0 i = Rin¼1 zn, i = 1, 2, . . . , N . The values of z and z0 according to above noted example are also presented in Table 1. Obviously, when z > 0, the first base of codon is A or C, while when z < 0, the corresponding base is G or T, then z divides the 64 kinds of codons into two groups, i.e., amino/keto groups. In conclusion, the 64 kinds of codons can be classified into two groups in three ways by x, y and z, respectively as presented in Table 2. Therefore, we can obtain detail information of codons from x, y and z, and x0 , y0 and z0 embody their cumulative effects,
221
respectively, which can be used to exhibit the local and global structures. According to the construction of this C-curve, an arbitrary DNA sequence can be converted into a unique 3D curve containing no loops based on (x, y, i), which avoids loss of information due to overlapping. Besides, one can also obtain other forms of curves based on the alternative invariants inferred from C-curve, such as plot sets (x0 , y0 , z0 ). In continues, we get the corresponding C-curve of eleven different species of Table 3 and discuss about similarities and dissimilarities between these genes. In the following, we show the comparison between the coding sequences of Table 3 based on their C-curves. On observing figures 5 and 6, we note that the C-curves corresponding to Human and Chimpanzee are similar so are mouse and rat, Gorilla and Chimpanzee, Bovine and Goat while the Ccurve of gallus has great dissimilarity with others. Since for too long DNA sequences may some visual advantages fade and comparison will be complex and time consuming, for see the better similarity/dissimilarity, we obtain also a numerical characterization. For this purpose, we use the same method in [28].
3. Numerical characterization of DNA sequences Now, we construct two matrices according to C-curve for some given DNA sequences to facilitate quantitative comparisons of similarities/dissimilarities analysis of DNA sequences based on codons. Firstly, let Kabc denote the total number of the codon abc appearing in the given sequence, a 2 fA; C; T; Gg 2 fA; C; T; Gg, b, c 2 fA; C; T; Gg .The vertex (dot) V1 denotes the first dot of the Ccurve. The vertex Vi denotes the ith dot of the C-curve. Then let dabc denote the sum of geometrical lengths of edges between vertices V1 and Vabc of the C-curve, where Vabc denotes the vertex representing the codon abc appearing in the given sequence. The parameter pabc is defined as the distribution of codon abc frequency. For C-curve, after simple computation, we can obtain abc = RK abc d abc/Kabc, pabc = Kabc/N, where d abc denotes the sum of d k k¼1 k geometrical length of edges between vertices V1 and Vabc of the C-curve when the codon abc appears kth time in the given
Figure 6. C-curve of the first exon of b-globin gene of eleven different species in one graph.
222
N. Jafarzadeh, A. Iranmanesh / Mathematical Biosciences 241 (2013) 217–224
Table 4 The s-Matrix of the eleven different species presented in Table 3.
Table 5 The p-Matrix of the eleven different species presented in Table 3.
Human 45/2 0 39/3 0 0 0 25/2 19 49/2 0 22 21 0 0 0 0 0 0 0 34/2 16/2 0 0 0 50/3 0 39/3 0 0 0 0 0 0 0 0 2 0 15 0 0 00000000 50009000 0 0 45/3 0 0 0 0 0
Chimpanzee 43/2 29 39/3 0 0 0 59/3 19 81/3 0 22 21 0 0 0 0 0 0 0 65/3 16/2 0 0 0 23/3 0 39/3 0 0 0 0 33 0 0 30 2 0 15 0 0 00000000 33/2 0 0 0 9 0 0 0 0 0 17/2 0 0 0 0 0
Human 2/30 0 3/30 0 0 0 2/30 1/30 2/30 0 1/30 1/30 0 0 0 0 0 0 0 0 2/30 2/30 0 0 3/30 0 3/30 0 0 0 1/30 0 0 0 0 1/30 0 1/30 0 0 00000000 1/30 0 0 0 1/30 0 0 0 0 0 3/30 0 0 0 0 0
Chimpanzee 2/35 1/35 3/35 0 0 0 3/35 1/35 3/35 0 1/35 1/35 0 0 0 0 0 0 0 3/35 2/35 0 0 0 2/35 0 3/35 0 0 0 1/35 1/35 0 0 1/35 1/35 0 1/35 0 0 00000000 2/35 0 0 0 1/35 0 0 0 0 0 2/35 0 0 0 0 0
Goat 52/3 0 33/3 0 0 0 21/2 0 22 0 20 19 0 0 17 0 33/3 0 0 21 2 0 0 0 33/2 0 43/3 0 10 0 0 0 0 0 0 0 0 13 0 0 00000000 00000000 0 0 27/2 0 0 0 0 12
Bovine 41/2 0 33/3 0 0 0 21/2 0 45/2 0 20 19 0 0 17 0 5 0 0 21 2 0 0 0 44/3 0 34 9 10 0 0 0 0 0 0 0 0 13 0 0 00000000 0 0 0 0 0 0 0 12 0 0 27/2 0 0 0 0 0
Goat 3/27 0 3/27 0 0 0 2/27 0 1/27 0 1/27 1/27 0 0 1/27 0 3/27 0 0 1/27 1/27 0 0 0 2/27 0 3/27 0 1/27 0 1/27 0 0 0 0 0 0 1/27 0 0 00000000 00000000 0 0 2/27 0 0 0 0 1/27
Bovine 2/28 0 3/28 0 0 0 2/28 0 2/28 0 1/28 1/28 0 0 1/28 0 2/28 0 0 1/28 1/28 0 0 0 3/28 0 2/28 1/28 1/28 0 1/28 0 0 0 0 0 0 1/28 0 0 00000000 0 0 0 0 0 0 0 1/28 0 0 2/28 0 0 0 0 0
Rat 66/3 0 33/2 0 0 0 25/2 19 0 16 0 26/2 12 0 0 22 40/3 0 0 34/2 14/2 0 0 0 27 0 19/2 0 0 0 0 0 0 0 0 2 0 15 0 0 00000000 20 0 3 0 0 0 0 0 0 0 42/2 0 0 0 0 0
Mouse 29 0 33/2 0 0 0 25/2 19 49/2 16 22 13 0 0 0 0 25/3 0 0 23 4 0 0 0 27 0 19/2 11 0 0 0 0 0 0 0 2 13 15 0 0 00000000 0 0 0 0 12 0 0 0 0 0 15 0 20 0 0 0
Rat 3/30 0 2/30 0 0 0 2/30 1/30 0 1/30 0 2/30 1/30 0 0 1/30 3/30 0 0 2/30 2/30 0 0 0 1/30 0 2/30 0 0 0 1/30 0 0 0 0 1/30 0 1/30 0 0 00000000 1/30 0 1/30 0 0 0 0 0 0 0 2/30 0 0 0 0 0
Mouse 1/30 0 2/30 0 0 0 2/30 1/30 2/30 1/30 1/30 2/30 0 0 0 0 3/30 0 0 1/30 1/30 0 0 0 1/30 0 2/30 1/30 0 0 1/30 0 0 0 0 1/30 1/30 1/30 0 0 00000000 0 0 0 0 1/30 0 0 0 0 0 3/30 0 1/30 0 0 0
Gallus 29/2 24 13/2 0 0 0 25/2 0 0 0 48/2 0 0 0 0 19 50004000 82/4 0 21/2 18 12 0 0 11 0 0 9 2 0 18/2 0 0 0 0 0 0 23 0 0 0 00000000 0 0 28 24/2 0 0 0 0
Gorilla 45/2 0 39/3 0 0 30 25/2 19 49/2 0 22 21 0 0 0 0 0 0 0 34/2 16/2 0 0 0 50/3 0 39/3 0 0 0 0 0 0 0 0 2 0 15 0 0 00000000 50009000 0 0 45/3 0 0 0 0 0
Gallus 2/30 1/30 2/30 0 0 0 2/30 0 0 0 2/30 0 0 0 0 1/30 1/30 0 0 0 1/30 0 0 0 4/30 0 2/30 1/30 1/30 0 1/30 1/30 0 0 1/30 1/30 0 2/30 0 0 0 0 0 0 1/30 0 0 0 00000000 0 0 1/30 2/30 0 0 0 0
Gorilla 2/31 0 3/31 0 0 1/31 2/31 1/31 2/31 0 1/31 1/31 0 0 0 0 0 0 0 2/31 2/31 0 0 0 3/31 0 3/31 0 0 0 1/31 0 0 0 0 1/31 0 1/31 0 0 00000000 1/31 0 0 0 1/31 0 0 0 0 0 3/31 0 0 0 0 0
Rabbit 45/2 0 39/3 0 0 0 25/2 0 49/2 0 43/2 0 5 0 0 19 0 0 0 23 12 0 0 0 40/2 10 39/3 11 0 0 0 0 0 0 0 0 0 15 0 0 00020000 00009000 0 0 35/3 0 4 0 0 0
Opossum 29 0 39/3 21 0 0 25/2 9 49/2 0 0 0 0 0 0 0 0 0 0 20 39/3 0 0 0 27 0 19/2 0 13 0 0 25/2 0 0 41/2 2 1 0 15 0 0 00000000 0 0 0 28 21/2 0 0 0 00000030
Rabbit 2/30 0 3/30 0 0 0 2/30 0 2/30 0 2/30 0 1/30 0 0 1/30 0 0 0 1/30 1/30 0 0 0 2/30 1/30 3/30 1/30 0 0 1/30 0 0 0 0 0 0 1/30 0 0 0 0 0 1/30 0 0 0 0 0 0 0 0 1/30 0 0 0 0 0 3/30 0 1/30 0 0 0
Opossum 1/30 0 3/30 1/30 0 0 2/30 1/30 2/30 0 0 0 0 0 0 0 0 0 0 1/30 3/30 0 0 0 1/30 0 2/30 0 1/30 0 1/30 2/30 0 0 2/30 1/30 1/30 1/30 0 0 0000000000 0 0 0 1/30 2/30 0 0 0 0 0 0 0 0 0 1/30 0
Lemur 70/3 0 60/4 0 0 0 17 0 24 0 0 19 4 0 22 8 14/2 0 20 23 1 0 0 0 27 0 18 11 12 0 0 0 0 0 0 0 0 15 0 0 0 0 0 10 0 0 0 0 0 0 0 0 13 0 0 0 0 0 17/2 0 0 0 3 0/2 0
Lemur 3/30 0 4/30 0 0 0 1/30 0 1/30 0 0 1/30 1/3 0 0 1/30 1/3 0 2/30 0 1/30 1/30 1/30 0 0 0 1/30 0 1/30 1/30 1/30 0 1/30 0 0 0 0 0 0 1/30 0 0 0 0 0 1/30 0 0 0 0 0 0 0 0 1/30 0 0 0 0 0 2/30 0 0 0 2/30 0
sequence. Here, we calculate the space-sum Matrix (s-M) and the distribution Matrix (p-M) as follows:
2 GGC d 6 GGT 6d 6 6 GCT 6d 6 6 GCC 6d sM ¼6 6 CGC 6d 6 6 CGT 6d 6 6 CCT 4d CCC d
GGG d GGA d
GAG d GAA d
GAC d GAT d
AGC d AGT d
AGG d AGA d
AAG d AAA d
GCA d GCG d
GTA d GTG d
GTT d GTC d
ACT d ACC d
ACA d ACG d
ATA d ATG d
CGG d CGA d
CAG d CAA d
CAC d CAT d
TGC d TGT d
TGG d TGA d
TAG d TAA d
CCA d CCG d
CTA d CTG d
CTT d CTC d
TCT d TCC d
TCA d TCG d
TTA d TTG d
AAC 3 d AAT 7 7 d 7 ATT 7 d 7 7 ATC 7 7 d 7 TAC 7 d 7 7 TAT 7 d 7 7 TTT 7 d 5 TTC d
2
PGGC 6 GGT 6p 6 6 pGCT 6 6 GCC 6p pM ¼6 6 pCGC 6 6 CGT 6p 6 6 CCT 4p CCC
p
pGGG
pAGC
pGAC
pAGC
pAGG
pAAG
pGGA
pGAA
pGAT
pAGT
pAGA
pAAA
pGCA
pGTA
pGTT
pACT
pACA
pATA
pGCG
pGTG
pGTC
pACC
pACG
pATG
CGG
p pCGA
CAG
p pCAA
CAC
p pCAT
TGC
p pTGT
TGG
p pTGA
pTAG pTAA
pCCA
pCTA
pCTT
pTCT
pTCA
pTTA
CCG
CTG
CTC
TCC
TCG
pTTG
p
p
p
p
p
pAAC
3
7 pAAT 7 7 pATT 7 7 7 pATC 7 7 pTAC 7 7 7 pTAT 7 7 7 pTTT 5 TTC p
The biological meaning of these two matrices is that they indicate the mean spaces and the distributions of codons in the graph of the given sequences, respectively. Here we regard them as the invariants to numerically characterize the DNA sequences.
223
N. Jafarzadeh, A. Iranmanesh / Mathematical Biosciences 241 (2013) 217–224 Table 6 The similarity/dissimilarity matrix for the sequences of Table 3 based on s-Matrix. Species
Human
Chimpanzee
Goat
Bovine
Gallus
Mouse
Rat
Gorilla
Rabbit
Opossum
Lemur
Human Chimpanzee Goat Bovine Gallus Mouse Rat Gorilla Rabbit Opossum Lemur
0
56.266 0
35.128 66.976 0
40.462 69.713 28.379 0
66.741 74.702 64.870 67.342 0
36.592 69.130 48.575 52.907 72.158 0
51.889 76.811 58.372 64.605 67.044 55.769 0
30 63.764 46.195 50.371 73.174 47.317 59.946 0
39.146 69.094 43.492 45.924 59.279 44.436 55.146 49.320 0
59.598 68.948 63.419 67.524 79.954 62.589 71.253 66.723 59.208 0
52.737 77.764 44.044 45.449 75.889 57.655 62.529 60.673 51.705 63.928 0
Table 7 The similarity/dissimilarity matrix for the sequences of Table 3 based on p-Matrix. Species
Human
Chimpanzee
Goat
Bovine
Gallus
Mouse
Rat
Gorilla
Rabbit
Opossum
Lemur
Human Chimpanzee Goat Bovine Gallus Mouse Rat Gorilla Rabbit Opossum Lemur
0
0.092 0
0.165 0.187 0
0.221 0.213 0.195 0
0.194 0.203 0.195 0.217 0
0.163 0.171 0.161 0.223 0.221 0
0.216 0.204 0.181 0.200 0.235 0.188 0
0.033 0.095 0.166 0.221 0.193 0.162 0.214 0
0.124 0.174 0.151 0.233 0.188 0.170 0.226 0.126 0
0.194 0.241 0.161 0.228 0.226 0.221 0.230 0.193 0.216 0
0.194 0.161 0.203 0.228 0.221 0.182 0.188 0.193 0.176 0.216 0
4. Similarities/dissimilarities among the complete coding sequences of b-globin gene of different species In this section, to illustrate the utility of this 3D graphical of DNA sequences based on codons, we will consider similarities and dissimilarities among the eleven exons of Table 3. Following the method mentioned in Section 3, we can get the corresponding s-Matrix (Table 4) and p-Matrix (Table 5) of each gene as follows: In other word, we construct two 64-component vectors (s-vector, p-vector): s-vector consisting of the 64 space-sums in the matrix s-M; p-vector consisting of the 64 distributions in the matrix p-M. The similarities among such vectors can be computed in two ways: 1. Calculating the Euclidean distance between the end points of the s-vectors; 2. Calculating the Euclidean distance between the end points of the p-vectors. We list the similarities and dissimilarities for the eleven complete coding sequences in Tables 6 and 7. Observing above tables, we note that Goat and Bovine, Human and Gorilla are similar so are Human and Goat, Human and Chimpanzee, Human and Gorilla, Gorilla and Chimpanzee, on the other hand, Gallus (the only none mammal among them) and Opossum (the most remote species from the remaining mammals) are most dissimilar to other among the 11 species. This is analogous to the results reported by other authors [36]. In the following we give a remark on s-Matrix: Remark. According to considering s-Matrix, since that we use dkabc which is the sum of geometrical length of edges between vertices V1 and Vabc, we can see with the increasing number of codons in a DNA sequence, dkabc will be increased much (since we use the number of position of each codon) and when two DNA sequences have a similar number of codons, we can get a good similarity/ dissimilarity comparison by this matrix, but for DNA sequences with different number of codons, s-Matrix cannot give a proper comparison. As you see in Table 3, these eleven different species
have not a similar number of codons, Chimpanzee: 35 codons, Goat: 27 codons, Bovine: 27 codons, Gorilla: 31 codons, and another species have 30 codons. There for, this causes that the comparison doesn’t give a proper result. For example, in Table 6, we can see all the number of Chimpanzee in s-Matrix is rather large, because the number of codons of Chimpanzee is the largest.
5. Conclusion According to our previous work [30] in this paper, we propose a novel 3D graphical representation based on codon (which we call it C-curve) without loss of information. Comparing the C-curve of the first exon of the b-globin gene from eleven species illustrates the utility of this representation. Acknowledgements This work was partially supported by Center of Excellence of Algebraic Hyperstructures and its Applications of Tarbiat Modares University (CEAHA). We are grateful to the anonymous referees for their remarks and suggestions. References [1] E. Hamori, J. Ruskin, H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences, J. Biol. Chem. 258 (1983) 1318. [2] M.A. Gates, A simple way to look at DNA, Journal of Theoretical Biology 119 (1986) 319. [3] A. Nandy, A new graphical representation and analysis of DNA sequence structure I. Methodology and application to globin genes, Curr. Sci. 66 (1994) 309. [4] P.M. Leong, S. Morgenthaler, Random walk and gap plots of DNA sequences, Comput. Appl. Biosci. 11 (1995) 503. [5] C.T. Zhang, R. Zhang, H.Y. Ou, The Z-curve databases: a graphic representation of genome sequence, Bioinformatics 19 (2003) 593. [6] R. Zhang, C.T. Zhang, Z curve, an intuitive tool for visualizing and analyzing the DNA sequences, J. Biomol. Struct. Dyn. 11 (1994) 767.
224
N. Jafarzadeh, A. Iranmanesh / Mathematical Biosciences 241 (2013) 217–224
[7] A. Nandy, M. Harle, S.C. Basak, Mathematical descriptors of DNA sequences, ARKIVOC 9 (2006) 211. [8] B. Liao, T.M. Wang, New 2D graphical representation of DNA sequences, J. Comput. Chem. 25 (2004) 1364. [9] B. Liao, W. Zhu, Y. Liu, 3D graphical representation of DNA sequence without degeneracy and its applications in constructing phylogenic tree, MATCH Commun. Math. Comput. Chem. 56 (2006) 209. [10] B. Liao, A 2D graphical representation of DNA sequence, Chem. Phys. Lett. 401 (2005) 196. [11] M. Randic´, M. Vrac´ko, A. Nandy, S.C. Basak, On 3-D graphical representation of DNA primary sequences and their numerical characterization, J. Chem. Inf. Comput. Sci. 40 (2000) 1235–1244. [12] J.F. Yu, J.H. Wang, X. Sun, Analysis of similarities/dissimilarities of DNA sequences based on a novel graphical representation, MATCH Commun. Math. Comput. Chem. 63 (2010) 493. [13] M. Randic´, A.T. Balaban, On a four-dimensional representation of DNA primary sequences, J. Chem. Inf. Comput. Sci. 43 (2003) 532–539. [14] E. Hamori, J. Ruskin, H. Curves, A novel method of representation of nucleotide series especially suited for long DNA sequences, J. Biol. Chem. 258 (1983) 1318. [15] A. Nandy, Graphical analysis of DNA sequence structure: III. Indications of evolutionary distinctions and characteristics of introns and exons, Curr. Sci. 70 (1996) 661. [16] A. Nandy, On the uniqueness of quantitative DNA difference descriptors in 2D graphical representation models, Chem. Phys. Lett. 368 (2003) 102. [17] X. Guo, X. Nandy, Numerical characterization of DNA sequences in a 2-D graphical representation scheme of low degeneracy, Chem. Phys. Lett. 369 (2003) 361. [18] M. Randic, M. Vracko, On the similarity of DNA primary sequences, J. Chem. Inf. Comput. Sci. 40 (2000) 599. [19] R. Wu, Q. Hu, R. Li, G. Yue, A Novel Composition Coding Method of DNA Sequence and Its Application, MATCH Commun. Math. Comput. Chem. 67 (2012) 269. [20] X. Zhou, K. Li, M. Goodman, A. Sallam, A novel approach for the classical ramsey number problem on DNA-based supercomputing, MATCH Commun. Math. Comput. Chem. 66 (2011) 347. [21] Q. Zhang, B. Wang, On the bounds of DNA coding with H-distance, MATCH Commun. Math. Comput. Chem. 66 (2011) 371.
[22] C. Li, N. Tang, J. Wang, Directed graphs of DNA sequences and their numerical characterization, J. Theor. Biol. 241 (2006) 173. [23] Y. Zhang, W. Chen, Invariants of DNA sequences based on 2DD-curves, J. Theor. Biol. 242 (2006) 382. [24] J. Feng, Y. Hu, P. Wan, A. Zhang, W. Zhao, New method for comparing DNA primary sequences based on a discrimination measure, J. Theor. Biol. 266 (2010) 703. [25] X.Q. Liu, Q. Dai, Z. Xiu, T. Wang, PNN-curve: a new 2D graphical representation of DNA sequences and its application, J. Theor. Biol. 243 (2006) 55. [26] P. He, D. Li, Y. Zhang, X. Wang, Y. Yao, A 3D graphical representation of protein sequences based on the Gray code, J. Theor. Biol. 304 (2012) 81–87. [27] J.F. Yu, X. Sun, J.H. Wang, TN curve: a novel 3D graphical representation of DNA sequence based on trinucleotides and its applications, J. Theor. Biol. 261 (2009) 459. [28] X.Q. Qi, J. Wen, Z.H. Qi, New 3D graphical representation of DNA sequence based on dual nucleotides, J. Theor. Biol. 249 (2007) 681. [29] G. Xie, Z. Mo, Three 3D graphical representations of DNA primary sequences based on the classifications of DNA bases and their applications, J. Theor. Biol. 269 (2011) 123. [30] N. Jafarzadeh, A. Iranmanesh, A novel graphical and numerical representation for analyzing DNA sequences based on codons, MATCH Commun. Math. Comput. Chem. 68 (2012) 611. [31] D.R. Forsdyke, Calculation of folding energies of single-strand nucleic acid sequences: conceptual issues, J. Theor. Biol. 248 (2007) 745. [32] A. Ivan, M.S. Halfon, S. Sinha, Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs, Genome Biol. 9 (2008) 1. [33] B.E. Blaisdell, A measure of the similarity of sets of sequences not requiring sequence alignment, in: Proceedings of the National Academy of Sciences of the United States of America, 83 (1986), 5155–5159. [34] S. Vinga, J. Almeida, Alignment-freesequencecomparison—a review, Bioinformatics 19 (2003) 513. [35] L.W. Parfrey, J. Grant, Y.I. Tekle, E. Lasek-Nesselquist, H.G. Morrison, M.L. Sogin, D.J. Patterson, L.A. Katz, Broadly sampled multigene analyses yield a wellresolved eukaryotic tree of life, Syst. Biol. 59 (2010) 518. [36] P. He, J. Wang, Characteristic sequences for DNA primary sequence, J. Chem. Inf. Comput. Sci. 42 (2002) 1080.