Methods xxx (2014) xxx–xxx
Contents lists available at ScienceDirect
Methods journal homepage: www.elsevier.com/locate/ymeth
Codon-based encoding for DNA sequence analysis Byeong-Soo Jeong a, A.T.M. Golam Bari a, Mst. Rokeya Reaz a, Seokhee Jeon a, Chae-Gyun Lim a, Ho-Jin Choi b,⇑ a b
Department of Computer Engineering, Kyung Hee University, 1-Seocheon-dong, Yongin-si 446-701, Gyeonggi-do, Republic of Korea Department of Computer Science, Korean Advanced Institute of Science and Technology, 291 Daehak-ro, Yuseong-gu, Daejeon 305-701, Republic of Korea
a r t i c l e
i n f o
Article history: Available online xxxx Keywords: Encoding DNA sequence Codon Sequence similarity DNA visulization
a b s t r a c t With the exponential growth of biological sequence data (DNA or Protein Sequence), DNA sequence analysis has become an essential task for biologist to understand the features, functions, structures, and evolution of species. Encoding DNA sequences is an effective method to extract the features from DNA sequences. It is commonly used for visualizing DNA sequences and analyzing similarities/dissimilarities between different species or cells. Although there have been many encoding approaches proposed for DNA sequence analysis, we require more elegant approaches for higher accuracy. In this paper, we propose a noble encoding approach for measuring the degree of similarity/dissimilarity between different species. Our approach can preserve the physiochemical properties, positional information, and the codon usage bias of nucleotides. An extensive performance study shows that our approach provides higher accuracy than existing approaches in terms of the degree of similarity. Ó 2014 Elsevier Inc. All rights reserved.
1. Introduction DNA is a nucleic acid that contains the genetic instructions used during the development and functioning of all known living organisms. Thus, analyzing DNA sequences composed of four nucleotides (i.e., Adenine, Guanine, Cytosine, and Thymine) is a fundamental starting point for understanding biological functions. However, we know that it is very difficult to obtain biological information directly from large DNA sequences. Due to the level of complexity, mathematical analysis of large volumes of sequence data are at present a challenge for bio-scientist. Sequence alignment has been the basic studying strategy for sequence (DNA or protein) analysis in bioinformatics. Although it remains a popular practical method for sequence analyses, the exponential growth of sequence data requires another new method that is more intuitive and incurs a relatively low computational cost. For this reason, DNA encoding, i.e., the numerical representation of DNA sequences has been widely studied in relation to several applications such as graphical visualization [10,15], similarity analyses [11,17,18], PPI prediction [18] and in an effort to find evolutionary relationship between species. In this paper, we propose a simple but effective encoding approach for measuring the degree of similarity/dissimilarity ⇑ Corresponding author. Fax: +82 423503510. E-mail address:
[email protected] (H.-J. Choi).
between different species. Our approach can preserve the physiochemical property, positional information, and also the codon usage bias of nucleotides. An Extensive performance study shows that our approach can provide greater accuracy than existing approaches in terms of the degree of similarity. 2. Related works Graphical representation of DNA sequences was first proposed by Hamori and Ruskin [6]. Later, that many advances in 2D [5,7,15,16], 3D [1,9,14], 4D [12], 5D [11] and 6D [13] representations of DNA sequences were developed. In this type of graphical presentation, nucleotides or dinucleotides or tri-nucleotides are given a Cartesian coordinate in several dimensions (i.e., from 2D to 6D). Then, DNA sequences are mapped on a set of Cartesian points and plotted. Most of the above methods utilize the chemical structure of each nucleotide. We classify four nucleotides into three categories, i.e., purines/pyrimidines (A,G)/(C,T), amino/keto groups (A,C)/(G,T), and those with strong–weak hydrogen bonds (A,T)/(G,C). Based on this chemical property, each nucleotide can be represented by 3D coordinate or 2D coordinate values to represent each category. For the 2D coordinate, we assign (1,1), (1,1), (1,1), (1,1) to represent A, G, C and T, respectively. Other than those, there are several studies which compare DNA sequences based on different numerical characterizations.
http://dx.doi.org/10.1016/j.ymeth.2014.01.016 1046-2023/Ó 2014 Elsevier Inc. All rights reserved.
Please cite this article in press as: B.-S. Jeong et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.01.016
2
B.-S. Jeong et al. / Methods xxx (2014) xxx–xxx
For example, Ronghui et al. [19] proposed 10 correlation factors consisting of four mono nucleotide and six dinucleotide factors. Qi et al. [17] proposed a graph-theory-based representation scheme for DNA sequences. A word-based measure [4] is one of the most widely used alignment-free approaches for sequence comparisons where each sequence is mapped into an n-dimensional vector according to its k-word frequencies/probabilities. A codon is the three-unit sequence (AUG, AGC, etc.) of mRNA nucleotides that codes for a specific amino acid. Codons have many advantages for preserving genetic information, such as mutation detections and predictions of genes or proteins [10]. Therfore, codon-based feature extraction from DNA sequences is highly informative. Motivated by this characteristics, we design an effective encoding scheme for DNA sequences based on a codon that experiences no loss of information and has no cycle, while using unique mapping process. In our proposed approach, we convert DNA sequences into a 3D curve based on codon information such that it preserves chemical properties of nucleotides. Our approach satisfies the requirements of uniqueness and the preservation of the DNA sequence. It also provides efficient similarity comparisons between DNA sequences with high levels of accuracy in comparisons. Another type of three dimensional graphical representation, Zcurve, proposed by Zhang et. al. [3] is based on so called Z-transform of DNA sequence and cumulative frequency of nucleotide in that sequence. Z-curve has several important applications like analysis and comparison of genome sequences, determination of G + C content in a sequence and so on.
3. 3D Encoding of DNA sequences 3.1. Chemical structures of nucleotide Nucleotides are the basic building block of DNA. There are generally four types of nucleotides. These are Adenine (A), Guanine (G), Cytosine (C) and Thymine (T). Nucleotides are classified into two groups in three ways i.e., ring structure (purine/pyrimidine), functional group (amino/keto), and hydrogen bond (strong-H/ weak-H). The common aspect of their chemical structures is a heterogenic hexagonal cycle. A and G have one hexagon and one pentagon each. On the other hand, C and T have only one hexagon each. Though nucleotides are common in ring structures, they differ in terms of their molecular weight. As an example, the weights of A, T, C and G are 135.13, 112.1, 111.1 and 151.13, respectively. We use the nucleotides hexagonal symmetry as well as the molecular weight to distribute 64 codons into two dimensional Cartesian spaces.
3.2. Properties of a regular hexagon In geometry, a hexagon is a polygon with six sides and six angles. A regular hexagon has six rotational symmetries and six reflection symmetries, making up what is known as dihedral group D6 . We use these properties to find the coordinate of each codon in the Cartesian space. The notable properties are summarized below. The length of each side and the interior angles are equal to those of a regular hexagon. The longest diagonals of a regular hexagon, connecting diametrically opposite the vertices, are twice the length of one side. Fig. 1 shows a regular hexagon and its properties. Some of the observations from Fig. 1 are as follows: if we assume that the length of each segment (PQ = QR = RS =pST ffiffiffi = TU = UP) is one, then the length of segments QU and RT is 3 and the length of each
Fig. 1. A regular hexagon.
segment (PS = RUp =ffiffiffiQT), (QX = XU = RY = YT), (XY = UT = QT) are 2, 3/2, 1/2 and 1, respectively.
(PX = YS),
3.3. Distribution of 64 codons The proposed model positions each codon on the surface of a hexagon based on their molecular weight and chemical classification. Every group is positioned on the both side of the longest diagonal of the hexagon. The midpoints of each arm are determined by their end points. The other four points reside on the connecting line of the hexagonal end point as shown in Fig. 1(a). The purine group generates AG and GA but we put AG in position 1 because it retains the ascendency of the molecular weight. The 2, 3, 4, 5 and 6 ends of the hexagon follow the same rule for the purine. The midpoints of 2–3, 3–4 and 4–5 are determined by the following rule: take the uncommon nucleotides and form with them a dinucleotide. The other three midpoints take the common and absent nucleotides to form a dinucleotide. We put hexagons on Cartesian 2D coordinates as shown in Fig. 2(b). We distribute each type of 43 = 64 codons in Cartesian 2D coordinates by adding A, T, C and G at the first dinucleotide residing in the first, second, third and fourth quadrant, respectively. 3.4. Proposed encoding Let S ¼ fs1 ; s2 ; . . . ; sn g be a DNA sequence where P si 2 ¼ fA; T; C; Gg and i ¼ 1; 2; 3; . . . ; n. S is mapped into a series of points P1 ; P2 ; . . . ; P n2 . We introduce a map function u such that S can be formulated as S ¼ uðsi ; siþ1 ; siþ2 Þuðsiþ1 ; siþ2 ; siþ3 Þ . . . u ðsn2 ; sn1 ; sn Þ, where uðsi ; siþ1 ; siþ2 Þ is an encoding value of codon ðsi ; siþ1 ; siþ2 Þ as in Table 1 and a series of points P i ¼ ðX i ; Y i ; iÞ where X i ¼ uðsi ; siþ1 ; siþ2 Þ; Y i ¼ uðsi ; siþ1 ; siþ2 Þ; i ¼ 1; 2; 3; . . . ; n 2. Xi; Y i, and i represent the x-coordinate, y-coordinate and z-coordinate, respectively. Thus, we obtain the n 2 points from a DNA sequence. To locate the local and global features of the 3D curve as well as to visualize the 3D representation of this curve, we use another numerP P ical representation. Let xi ¼ ik¼1 xk and yi ¼ ik¼1 yk , we derive another mapping function for cumulative feature of the 3D curve such that kðsi ; siþ1 ; siþ2 Þ ¼ ðxi ; yi ; iÞ where i ¼ 1; 2; 3; n 1. Connecting n 2 points from the begining, we obtain the proposed novel 3D zigzag curve. 3.5. Example of the proposed method To illustrate the proposed method, we consider the example DNA sequence S = ATACGATTCAGTACG. The length of S is 15. Hence, it will be converted into a 3D line which has 13 points. The 3D coordinates of those 13 points are presented in Table 1. 3.6. Properties of the proposed encoding Property 1. No circuit or degeneracy in the proposed 3D graphical representation of DNA sequences
Please cite this article in press as: B.-S. Jeong et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.01.016
3
B.-S. Jeong et al. / Methods xxx (2014) xxx–xxx
Fig. 2. Distribution of 64 codons on their chemical structure and classification.
Proof. We assume that there are one or more circuits in the graphical representation. Let the Pi is a vector of ith . If m points creates a loop, then it means that P1 þ P2 þ Pm = 0 )ðx1 ; y1 ; z1 Þþ ðx2 ; y2 ; z2 Þ þ þ ðxm ; ym ; zm Þ ¼ ð0; 0; 0Þ. However it contradicts P P with previous definition xi ¼ ik¼1 xk and yi ¼ ik¼1 yk . h Property 2. For a given DNA sequence, there is only one 3D graphical representation corresponding to it. Proof. For a given DNA sequence, there is a unique 3D graphical representation correspondingly. Each codon has unique 3D coordinates. Therefore, there is a unique sequence of (x, y, z) corresponding to a given DNA sequence. Even if there is only a single point mutation, three codons change at best around that point. Therefore, the graphical representation of a mutated sequence mainly changes in the mutated area. h Property 3. For a 3D graphical representation, there is only one DNA sequence. Proof. Each data point of a 3D graph corresponds to a codon and each codon represents a unique 3D Cartesian coordinate. Therefore, the conversion from 3D graph to 3D coordinates is one to one. h
3.7. Graphical representation of the proposed method In this section, we will show the graphical presentation of real DNA sequences based on our method. The mtDNA sequences of eight Eutherian Mammals are plotted into 3D graphs in Fig. 3. From the graphical representation we can infer that p.chimpa, c.chimpa and gorilla have the same graph; orangutan and gibbon have the same 3D graphs; and horse and w.rhinoce exhibit the same graphical representation. 3.8. Discussions The proposed model takes each successive position into account and forms a codon starting at that point. A sequence with n nucleotides will generate n 2 codons. The model can distinguish a mutation on any points. For this reason, the model has unique representation from sequence to graph, and vice versa. On the other hand, if we do not consider each successive position, there will be n=3 codons. There is a loss of information in this case if the
Table 1 Example of CODON. Codon
Point
x
y
z
ATA TAC ACG CGA GAT ATT TTC TCA CAG AGT GTA TAG AGG
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13
1 3.75 6.75 4.25 5.10 3.5 6 8 4.5 7.75 4.75 6.5 9.5
3.25 1.66 2.55 1.58 3.65 5.65 3.79 2.79 1.55 3.23 4.23 2.36 5.09
1 2 3 4 5 6 7 8 9 10 11 12 13
sequence length is not fully divisible by 3. Two sequences whose lengths are not fully divisible by 3 will lose n%3 (n modulo 3) nucleotides. If the sequences differ at the last one or two positions, those sequences will generate the same graphical representation. This drawback can be found in the model of Nafiseh et al. [9]. Furthermore, the model is capable of classifying codons by their chemical groups as described in Section 3.1. When x > 0, the codons start with A or T; otherwise, they start with G or C. Similarly, y > 0 denotes the first base A or G, otherwise it is C or T. According to this inspection, we find that x divides weak-H/strong-H groups and y divides the purine/pyrimidine groups. The hexagonal model is based on nucleotides chemical property used to classify them in different chemical group. It also uses the molecular weight of nucleotides to determine 3D coordinate of each codon. Furthermore, this model can be extended for six different cycles where each cycle will produce its own graphical and numerical analysis for species [2]. So, the model is effective to compare with other models in terms of concept because the other methods described in research papers [10,11,15,19] have only numerical significance but lack of any biological relevance. Our proposed method has biological as well as numerical importance. Conceptually we can design different methods numerically as well as graphically to show evolutionary relationship among different species. Graphical representation of DNA sequence based on codon requires satisfying two properties symmetry and periodicity because they act as the harmony between the chosen geometry and the biological reality. Otherwise, it would merely be an instance of displaying the nucleotides (e.g, mononucleotide, dinucleotide, codon) which have little biological sense. The proposed hexagonal model uses the hexagonal symmetry by positioning
Please cite this article in press as: B.-S. Jeong et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.01.016
4
B.-S. Jeong et al. / Methods xxx (2014) xxx–xxx
Fig. 3. The graphical representation of each species.
two groups (e.g. purine and pyrimidine, amino and keto, strong-H and weak-H) on the vertex of a hexagon in such a way that their connecting line symmetrically bisects the hexagon. In addition, the clockwise or anti clock wise 360 degree rotation of any hexagonal vertex will create six different cycles which we did in our previous work [2]. On the other hand, we used different similarity metrics (e.g. Euclidian, Cosine) for numerical analysis and determined phylogenic analysis to show the evolutionary proximity and distance among different species. It is very important to find out the position of mutation and its types. The hexagonal model does not work only for a visualization and numeric tool but also for mutation detection in biological sequences. It can detect any point (single or multi point) mutation for n nucleotide mutation it shows variation in ðn þ 2Þ points in 3D curve of the sequence. Furthermore, the method can preserve as much information as any sequence has because it takes successive position of each nucleotide, shows no loop and avoids degeneracy which we already proved in Section 3.6. For example, consider beta-globin gene of Gorilla and Mouse [2]. The hexagonal model produces 91 points for each of their sequence because we took every possible successive codon position to construct feature vector. If we use C-curve [8] method then it will produce only 30 points which is definitely loss of information. Furthermore, if those two sequences have difference only in the last position (say sequence 1 is xxxxy and sequence 2 is xxxxz where x; y; z 2 fA; G; T; Cg then C-curve will draw same curve for those two sequences which is definitely a wrong interpretation.
sequences of eight Eutherian mammals, as shown in Table 2. These were also studied in two earlier studies [12–14]. Table 2 shows the characteristics of the dataset. Initially, we extract the features from the mtDNA sequences. Next, we show the overall performance of the proposed encoding scheme. We then apply some distance metrics such as Cosine and Euclidian metrics, to measure the similarity/dissimilarity among the different species. The shorter distance means the two sequences are more similar. We compare the results of our method with the results of the eariler works [11,19], showing that the proposed method is superior. Our programs were written in Python 2.7, and run on the Windows XP operating system on a Pentium dual-core 2.13 GHz CPU with 2 GB of main memory. We use BioPython 1.60 for sequence parsing. 4.1. Feature extraction We inspect codon usage and RSCU on the above dataset. Then we build a feature extraction method based on those two definitions. 4.1.1. Codon usage Let the DNA sequence in terms of codons be P s ¼ fs1 ; s2 ; s3 ; . . . ; sn g where si 2 ; n is the length of the sequence P in the codons; and ¼ fc1 ; c2 ; . . . ; c64 g is the alphabet of the codons. The codon usage r cj of codon cj is measured by the fraction of the cj codon of sequence s [17].
rk ¼ 4. Experimental analysis The model takes publicly available biological sequences (e.g. DNA, RNA, b-globin gene) of different species. Then it generates graphical model to visualize evolutionary relationship among different species. Furthermore, the method constructs feature vector for every single sequences to compare those species numerically. As a result, this research work outputs two types of analysis such as graphical and numerical (phylogenic and similarity/dissimilarity calculation respectively). To evaluate effectiveness of our approach, we apply our method to measure similarities/dissimilarities among the complete mtDNA
n 1X dðsi cj Þ n i¼1
ð1Þ
Table 2 mtDNA sequences of eight Eutherian mammals. Species
ID/Accession
Database
Length
Baboon Gibbon Orangutan Gorilla c.chimpa p.chimpa Horse W.rhinoce
Y18001 X99256 D38115 D38114 D38113 D38116 X79547 Y07726
NCBI NCBI NCBI NCBI NCBI NCBI NCBI NCBI
16,521 16,472 16,389 16,364 16,563 16,554 16,660 16,832
Please cite this article in press as: B.-S. Jeong et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.01.016
5
B.-S. Jeong et al. / Methods xxx (2014) xxx–xxx Table 3 Similarity/dissimilarity comparison based on d1 .
Baboon Gibbon Orangutan Gorilla c.chimpa p.chimpa Horse
Gibbon
Orangutan
Gorilla
p.chimpa
c.chimpa
Horse
w.rhinoceh
0.0149
0.0194 0.0088
0.0117 0.0115 0.0158
0.0108 0.0132 0.0164 0.0052
0.0108 0.0124 0.0158 0.0052 0.0235
0.0232 0.0299 0.0353 0.0213 0.0305 0.0031
0.0272 0.03592 0.0420 0.0295 0.0298 0.0230 0.0162
Here d is the Kronecker delta: dðÞ ¼ 1 if the argument inside is satisfied, and otherwise 0. The codon usage rc of the given gene sequence s is the vector r c = (r cj : j ¼ 1; 2; . . . ; 64Þ. The problem of r c is that it is a large matrix. Therefore, RSCU can be used to form the feature matrix. 4.1.2. RSCU (relative synonymous codon usage) RSCU was adopted to evaluate synonymous codon usage without the confounding influence of the amino acid compositions of different sequence samples [17]. For a given coding sequence, the RSCU value rk of synonymous codon k is calculated as:
r k ¼ nk
tot k obsk
ð2Þ
where obsk is the observed number of codons k; totk is the total observed number of codons coding for the amino acid coded by codon k, and nk denotes the number of synonymous codons. The standard genetic code contains 64 codons, but UGG and AUG denotes only one amino acid, i.e.,Tryptophan and Methionine, respectively. We discard two codons from their RSCU value due to of their unity. On the other hand, UGA, UAA and UAG are also not considered because they represent stop codons. Therefore, RSCU is not a better choice than simply using the codon to construct our feature vector, as some information will be lost. Therefore, we take the benefits of both and build our 21-dimensional feature vector. 4.1.3. Proposed feature vector The 64 codons are classified into 21 groups based on their synonymous codons. Due to code degeneracy, each amino acid corresponds to 1–6 codons. We can divide the 64 codons into 21 classes. The 21 dimensional feature vector can be defined as:
obtain F s and F h as their respective feature vectors, we measure d1 and d2 between F s and F h . Initially, we present the similarity/dissimilarity matrix based on the distance measurement d1 , as shown in Table 3. When we examine this table, we notice that the smallest entries are always associated with pairs (p.chimpa, c.chimpa) with d1 ¼ 0:0031, and (gibbon, orangutan) with d1 ¼ 0:0088. This indicates that the more similar species pairs are p.chimpa–c.chimpa and gibbon–orangutan. We also observe that the largest entry d1 ¼ 0:0420 is associated with orangutan and w.rhinoce and that the larger entries appear in the rows belonging to gibbon and w.rhinoce. These observed facts are consistent with the results reported in previous studies [11,19], as determined by matrix-invariant techniques. Table 4 presents the similarity/dissimilarity matrix based on the distance measurement d2 . The smallest entries are also associated with the pairs (p.chimpa, c.chimpa) with d2 ¼ 0:0001 and (gibbon, orangutan) with d2 = 0.0005. We find that the largest entry (d2 ¼ 0:0127) is associated with (orangutan, w.rhinoce) and that the rows corresponding to gibbon and w.rhinoce have larger entries. 4.3. Phylogenic analysis Further, the similarity matrix can be applied to construct the phylogenic tree for eight Eutherian mammals. The results of the phylogenic tree reflect the quality of the similarity matrix that efficiently extracts evolutionary information from DNA sequences with our encoding method. Given the similarity matrix, we can generate the phylogenic tree using the NJ (Neighbor Joining) method in PHYLIP package. In Fig. 4, we show the phylogenic tree of mtDNA sequence of eight Eutherian mammals based on d1 . We get the same tree while applying d2 . Observing this figure, we find
Pni fij ¼
j¼1 C ij
ð3Þ
C ij
where C ij is the number of occurrences of synonymous codon j of amino acid i and ni is the number of synonymous codons. 4.2. Numerical analysis of the proposed method We apply Euclidian (d1 ) and cosine similarity (d2 ) to the feature vectors. As an example, if there are two species s and h; and we
Fig. 4. The phylogenic tree of eight Eutherian mammals using mtDNA sequences.
Table 4 Similarity/dissimilarity comparison based on d2 .
Baboon Gibbon Orangutan Gorilla p.chimpa c.chimpa Horse
Gibbon
Orangutan
Gorilla
p.chimpa
c.chimpa
Horse
w.rhinoceh
0.0016
0.0026 0.0005
0.0010 0.0010 0.0017
0.0008 0.0013 0.0019 0.0002
0.0008 0.0011 0.0018 0.0002 0.0001
0.0040 0.0066 0.0089 0.0033 0.0038 0.0040
0.0055 0.0096 0.0127 0.0065 0.0065 0.0068 0.0020
Please cite this article in press as: B.-S. Jeong et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.01.016
6
B.-S. Jeong et al. / Methods xxx (2014) xxx–xxx
that more similar species pairs are p.chimpa–c.chimpa, gibbon– orangutan, and horse–w.rhinoce. 8
4.4. Mutation analysis Muted sequence
6
Z
Original sequence 4
2
0 10 8
5
6
Y
4
0 2 −5
X
0
Fig. 5. The case of substitution.
8
Different kinds of species have different molecular and genetic characters, and different genes are controlling different biological character expression. Some mutational changes in biological characters are due to nucleotide mutations in DNA molecules. Therefore, it is very important to find out the position of mutation and its types. In general, nucleotide mutations can be classified to three basic types, which are substitution, insertion, and deletion. Based on the plotted curve of our proposed 3D graphical representation, we can provide a way to analyze the mutation in a DNA sequence. We draw the projection curves of the initial sequence and the potentially mutated sequence in the same graph. The two curves will overlap perfectly if no mutation occurs in the potential sequence. However, if the two curves are different, nucleotide mutations must occur. In case of substitution, as an example if we take original sequence (‘‘ATA[C]GATTCA’’) and mutated sequence (‘‘ATA[G]ATTCA’’), it is obivious that substitution ocurrs at the 4th position and 2nd, 3rd, and 4th points of two sequences are differently located on the 3D space as in Fig. 5. In case of deletion, original sequence (‘‘ATACGATTCA’’) and mutated sequence (’’ATA[!]GATTCA’’), the points of two sequences are differently located after 2nd point (Fig. 6). Insertion can be detected by same way as deletion.
6
Z
4.5. Comparison with other methods 4
Muted Original
2
0 10 10
5
Y
5 X
0 −5
0
Fig. 6. The case of insertion/deletion.
We note that there is overall qualitative agreement between Tables 3 and 4. To observe this visually, we denote the degree of similarity/dissimilarity of the pair p.chimpa–c.chimpa as 1 in each table, after which the results of the examination of the degree of dissimilarity/similarity between p.chimpa and other several species under the two distance measurements are determind as shown in Fig. 7. We can see that the curvilinear trends of these two curves are nearly identical, which demonstrates the overall agreement among the dissimilarity/similarity results obtained by these two distance methods. In our method, cosine similarity provides the best result in terms of the degrees of similarity/dissimilarity. Furthermore, Euc-
Fig. 7. Comparison with other methods by degree of similarity/dissimilarity.
Please cite this article in press as: B.-S. Jeong et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.01.016
7
B.-S. Jeong et al. / Methods xxx (2014) xxx–xxx Table 5 Similarity Degree with p.chimpa compared with [11,19]. Methods/species
Baboon
Gibbon
Orangutan
Gorilla
c.chimpa
Horse
w.rhinoceh
Our Method Bo Liao, et. al. [11] Ronghui Wu, et. al. [19]
0.0008 0.0028 0.0062
0.0013 0.0117 0.0194
0.0019 0.0126 0.0242
0.0002 0.0022 0.0044
0.0001 0.0014 0.0036
0.0038 0.0165 0.0287
0.0065 0.0197 0.0399
lilidian similarity proves the consistency and supremacy of the proposed method as compared to existing ones. Baboon is an outlier among other species, which is sharply focused by the high degree of similarity/dissimilarity compared to the other methods shown in Fig. 7. On the other hand, the high difference between c.chimpa–horse and c.chimpa–w.rhinoce proves that the (horse, w.rhinoce) pair is different from c.chimpa. However, w.rhinoce and horse are close in terms of the degree of similarity. The variation in the degree in the proposed curvilinear trends proves the natural consistency as well as the supremacy of this method. To compare the hexagonal method with other existing ones, we choose research works [11,19] because they work on same datasets that we used as well as their results are promising. However, the proposed model is superior to them in terms of degree of similarity/dissimilarity among different species. For example, Table 5 shows the best similarity/dissimilarity values obtained by hexagonal model and the other research works [11,19]. The less the value in any entry demonstrates that the specific method is more accurate than other while comparing species with respect to p.chimpa. We can take any species like p.chimpa for reference. For all the entries, the hexagonal model shows the smallest values which indicate more accuracy in terms degree of similarity/ dissimilarity than others. All the entries for any pair species with p.chimpa in our method are approximately at least 10 times lower than other methods. The reason for this better performance is that we take all the positional probability of codons translation into amino acid while drawing 3D DNA curve and also creating feature vectors. Concludingly, our method preserves more information than existing methods and produces more clear distinction of similarity/dissimilarity values. 5. Conclusion We have proposed the codon-based 3D graphical representation scheme of DNA sequences and also experimented with several datasets to look for its effectiveness. In our coding scheme, the 64 codons are plotted on a 3D cartesian coordinate based on the chemical properties and positional information of the nucleotides. In graphical visulation, our coding scheme easily provide the information of evolutionary relationship between different species by
showing the curve shape of DNA sequences. An extensive performance study shows that our approach provides higher accuracy than existing approaches in terms of the degree of similarity. Acknowledgements This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government, Ministry of Science, ICT & Future Planning (MSIP) (No. 2010-0028631). It was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2013R1A1A2006236). References [1] A.T.M. Golam Bari, Mst. Rokeya Reaz, A.K.M Tauhidul Islam, Ho-Jin Choi, Byeong-Soo Jeong, Effective DNA Encoding for Splice Site Prediction using SVM, DASFAA 2013, Big Data Management Analytics, Wuhan China, 2013 (LNCS 7827), pp. 46–58.. [2] A.T.M. Golam Bari, Mst. Rokeya Reaz, Ho-Jin Choi, Byeong-Soo Jeong, Evolut. Bioinf. 9 (2013) 251–261. [3] C.T. Zhang, R. Zhang, H.Y. Ou, Bioinformatics 19 (2003) 593–599. [4] J. Ewens, G. Grant, Statistical Methods in Bioinformatics: An Introduction, Sprinter Science, New York, 2005. [5] Xiaofeng Guo, M. Randic, Subhash Basak, Chem. Phys. Lett. 350 (2001) 106– 112. [6] E. Hamori, J. Ruskin, J. Biol. Chem. 258 (1983) 1318–1327. [7] Yujuan Huang, Tianming Wang, Int. J. Quantum Chem. 112 (2012) 1746–1757. [8] Nafiseh Jafarzadeh, Ali Iranmanesh, MATCH Commun. Math. Comput. Chem. 68 (2012) 611–620. [9] Nafiseh Jafarzadeh, Ali Iranmanesh, Math. Biosci. 241 (2013) 217–224. [10] Y. Li, G. Huang, B. Liao, Z. Liu, MATCH Commun. Math. Comput. Chem. 61 (2009) 519–532. [11] Bo Liao, Renfa Li, Wen Zhu, Xuyu Xiang, J. Math. Chem. 42 (2007) 47–57. [12] Bo Liao, Mingshu Tan, Kequan Ding, Chem. Phys. Lett. 402 (2005) 380–383. [13] Bo Liao, Tian-ming Wang, J. Chem. Inf. Model. 44 (2004) 1666–1670. [14] Bo Liao, W. Zhu, Y. Liu, MATCH Commun. Math. Comput. Chem. 56 (2006) 209–216. [15] M. Randic, M. vracko, N. Lers, D. Plavsic, Chem. Phys. Lett. 368 (2003) 1–6. [16] Milan Randi, Marjan Vrako, Nella LerA, Dejan PlavAi, Chem. Phys. Lett. 371 (2003) 202–207. [17] Xinggin Qi, Qin Wu, Yusen Zhang, Eddie Fuller, Cun-Quan Zhang, Evolut. Bioinf. 7 (2011) 149–158. [18] Xianwen Ren, Yong-Cui Wang, Yong Wang, Xiang-Sun Zhang, Nai-Yang Deng, BMC Bioinf. 12 (2011) 409. [19] Wu. Ronghui, Qiguang Hu, Renfa Li, Guangxue Yue, MATCH Commun. Math. Comput. Chem. 67 (2012) 269–276.
Please cite this article in press as: B.-S. Jeong et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.01.016