Chemical Physics Letters 590 (2013) 192–195
Contents lists available at ScienceDirect
Chemical Physics Letters journal homepage: www.elsevier.com/locate/cplett
Similarity analysis of protein sequences based on 2D and 3D amino acid adjacency matrices Ali El-Lakkani, Seham El-Sherif ⇑ Department of Biophysics, Faculty of Science, Cairo University, Giza, Egypt
a r t i c l e
i n f o
Article history: Received 17 June 2013 In final form 10 October 2013 Available online 19 October 2013
a b s t r a c t This approach presents a 3D amino acid adjacency matrix based on the 2D amino acid adjacency matrix which was proposed by Randic´ et al. (2008) [1]. Furthermore, a novel numerical method is proposed to measure the degree of similarity based on 2D and 3D adjacency matrices. This new method is applied to nine ND5 proteins of different species. To prove the efficiency of the presented work a correlation with ClustalW and significance analyses are provided. The results show that our work is the most significant among other related works. Crown Copyright Ó 2013 Published by Elsevier B.V. All rights reserved.
1. Introduction Protein is composed of a linear array of amino acids, joined together by covalent peptide bonds. The amino acid sequence that makes a protein is called the primary structure. The three-dimensional, functional structure (conformation) of the protein depends on the amino acid sequence (primary structure). Advancement in sequencing techniques have led to an explosive growth in the number of known protein sequences in various databases. Protein sequences are stored in the computer database system in the form of long character strings; each amino acid is represented by one character. It is difficult to extract any features by directly reading these sequences. Therefore, many kinds of methods have been proposed to analyze protein sequences. Protein sequence comparison is done to identify the similarities and differences between different protein sequences; it helps to identify the structure and function of newly identified proteins, since similar sequences are expected to have similar structures, and to infer relationships between proteins that last shared a common ancestor billions of years ago. There are many similarity measures that provide sequence comparison. They can be divided into two classes: alignmentbased and alignment-free measures. Alignment-based measures use dynamic programming; it generates a matrix whose elements represent all possible alignments between two sequences. The highest set of sequential scores in the matrix defines an optimal alignment. But the search for optimal solutions encounters difficulties in: (i) computational load with regard to large databases; (ii) choosing the scoring schemes [2]. Therefore, alignment-free approaches [3] have been developed to overcome the limitations of alignment-based methods. Most of these approaches depend ⇑ Corresponding author. E-mail address:
[email protected] (S. El-Sherif).
on graphical representation and numerical characterization of protein sequences. Graphical representations of protein sequences have been used to compare protein sequences; it provides a simple way of viewing, sorting, comparing various sequences and provides mathematical descriptors which help in recognizing major differences among similar protein sequences quantitatively [4–8]. The graphical representation approach put forward by Hamori and Ruskin (1983) [9] for DNA sequences is a key approach to analyze biological sequences and to represent data visualization, and it has drawn the attention of many researchers. Graphical representation may be 2D [10–20] or 3D [21–26] in the Cartesian coordinate system. The basic philosophy of defining mathematical descriptors of sequences is to provide a tool to the biologists in the characterization of sequences in order to derive some kind of relative ranking of the sequences, for mutational or evolutionary studies, or prediction of functional properties [27]. Numerical characterization allows a quantitative comparison between protein sequences, so, it is a powerful method for similarity/dissimilarity analysis. Protein sequences can be numerically characterized by a mathematical object such as a matrix. A matrix may depend on a graph e.g. Euclidean distance matrix (ED), graph theoretical distance matrix (GD), path distance matrix (PD), L/L matrix and M/ M matrix, or may not depend on a graph such as amino acid adjacency matrix [1]. In this approach, a 3D amino acid adjacency matrix is introduced based on the 2D amino acid adjacency matrix that was proposed by Randic´ et al. [1]. This 3D adjacency matrix gives more information about the distribution of adjacent amino acids than the 2D adjacency matrix. Furthermore, a novel numerical method to measure the degree of similarity based on 2D and 3D amino acid adjacency matrices is introduced. This method is applied to ND5 proteins of nine different species. A correlation with ClustalW and significance analyses are performed and compared with other
0009-2614/$ - see front matter Crown Copyright Ó 2013 Published by Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.cplett.2013.10.032
A. El-Lakkani, S. El-Sherif / Chemical Physics Letters 590 (2013) 192–195
related approaches to prove the efficiency of this method. Finally the results for 3D amino acid adjacency matrix is better than the results for 2D amino acid adjacency matrix. 2. Materials and methods 2.1. Data set This approach is applied to NADH dehydrogenase subunit 5 (ND5) proteins of nine different species which are listed in Table 1. This data set is used before in other approaches [8,14,16, 17,22,23,25,26,28,29]. 2.2. 2D amino acid adjacency matrix Numerical characterization of protein primary sequence aims to capture the essence of the composition and distribution of adjacent amino acids within the sequence. However, adjacent amino acids distribution is more informative than amino acids composition numbers [29]. The 2D amino acid adjacency matrix [1] is a 20 20 nonsymmetrical matrix which records the adjacency of amino acids of protein sequence. The matrix element ei,j represents the number of times amino acid (i) followed by amino acid (j) in the protein sequence when it is read from left to right. The rows and the columns of the matrix are labeled alphabetically based on the three-letter codes of amino acids. The entries in the rows indicate the adjacencies when the protein sequence is read from left to right and the entries in the columns indicate the adjacencies when the protein sequence is read from right to left. The row sums of the matrix give the amino acid abundance, except for the initial amino acid, which is not counted. Similarly the column sums of the matrix give the amino acid abundance, except for the last amino acid, which is not counted. 2.3. 3D amino acid adjacency matrix The 3D Adjacency matrix is a 20 20 20 matrix. Each dimension labeled alphabetically based on the three-letter codes of amino acids. The matrix element ei,j,k represents the number of times amino acid (i) followed by amino acid (j) followed by amino acid (k) in the protein sequence when it is read from left to right. It gives information about the first neighbor and the second neighbor of one amino acid. This 3D matrix consists of twenty 2D matrices; each of them is a square matrix 20 20. The sum of elements of each of these 2D matrices gives the abundance of one amino acid but for the first and the second amino acids of the protein sequence, the sum of elements of the corresponding 2D matrix plus one represents the abundance of the corresponding amino acid. 2.4. A novel numerical method for similarity/dissimilarity analysis In most approaches, similarity degree of protein sequences can be measured by calculating Euclidean distance or correlation angle
Table 1 Information of ND5 proteins for nine species. Species
ID (NCBI)
Length
Human (Homo sapiens) Gorilla (Gorilla gorilla) Pigmy chimpanzee (Pan paniscus) Common chimpanzee (Pan troglodytes) Fin whale (Balenoptera physalus) Blue whale (Balenoptera musculus) Rat (Rattus norvegicus) Mouse (Mus musculus) Opossum (Didelphis virginiana)
AP_000649 NP_008222 NP_008209 NP_008196 NP_006899 NP_007066 AP_004902 NP_904338 NP_007105
603 603 603 603 606 606 610 607 602
193
between proteins descriptors. In this approach a novel numerical method is proposed to measure the degree of similarity based on the percentage of common distribution properties using 2D and 3D adjacency matrices. In case of 2D adjacency matrix, the element ei,j represents the number of times amino acid (i) followed by amino acid (j) in the protein sequence when it is read from left to right. If the corresponding elements ei,j and e0i;j in the adjacency matrices for two protein sequences are equal; this means that there is a common number ei,j of the same distribution property present in the two protein sequences. If ei,j for one sequence is greater than the corresponding e0i;j for the other sequence; this means that there is a common number e0i;j of the same distribution property is present in the two protein sequences. Counting the total common numbers for all the distribution properties in the two sequences from all elements of their adjacency matrices and dividing by the total number of amino acids of the shortest sequence, one can get the percentage of the common distribution properties which is used to measure the similarity between the two sequences. In the 3D adjacency matrix, the element ei,j,k represents the number of times amino acid (i) is followed by amino acid (j) followed by amino acid (k) in the protein sequence when it is read from left to right. Following the same method for counting the total number of common distribution properties for two protein sequences using 2D adjacency matrices, we can determine the percentage of common distribution properties of two protein sequences using their 3D adjacency matrices, and is used also to measure the similarity between the two sequences.
3. Discussion and results The percentage of common distribution properties method for measuring the degree of similarity is applied to the nine ND5 proteins of nine species listed in Table 1. The similarity matrices in Tables 2 and 3 are performed by calculating the percentage of common distribution properties for each pair of protein sequences using their 2D and 3D adjacency matrices respectively. It is obvious that the larger the percentage of common properties of two proteins, the more similar they are. It can be observed from Tables 2 and 3 that the ND5 proteins of (human, gorilla, pigmy chimpanzee, common chimpanzee) are more similar to each other, the proteins of the two pairs (fin whale, blue whale) and (mouse, rat) are very similar to each other [appeared in bold] and on the other hand, the protein of Opossum is quite dissimilar to other species. Also, it can be found that the entries of (human, pigmy chimpanzee) and (human, common chimpanzee) are larger than the entry of (human, gorilla), which is consistent to the reality. To prove the efficiency of this approach, we have compared our results in Tables 2 and 3 with ClustalW results. The Clustal series of programs are widely used in molecular biology for the multiple alignments of both nucleic acid and protein sequences and for preparing phylogenetic trees. The popularity of the programs depends on a number of factors, including not only the accuracy of the results, but also the robustness, portability and user-friendliness of the programs [30,31]. The distance matrix for the nine ND5 proteins based on ClustalW is listed in Table 4 [Table 5 in Ref. [26]]. The correlation coefficients between the results of our approach and the results of ClustalW are calculated for the nine species and listed in the first two columns of Table 5. The correlation coefficients between Euclidean distances [ED] calculated from the full 2D adjacency matrix and the results of ClustalW are listed in the fourth column of Table 5. Also to make a comparison between our approach and other approaches in Refs. [1,8,14,23,25,26,29], correlation coefficients between the results of these approaches and the results of ClustalW for the nine species are calculated and provided in the remaining columns in Table 5. The approach
194
A. El-Lakkani, S. El-Sherif / Chemical Physics Letters 590 (2013) 192–195 Table 2 Similarity matrix based on the percentage of common distribution properties for each pair of proteins using their 2D adjacency matrices. Species
Gorilla
P. chimp.
C. chimp.
F. whale
B. whale
Rat
Mouse
Opossum
Human Gorilla P. chimp. C. chimp. F. whale B. whale Rat Mouse
88.7231
90.5473 88.3914
91.3765 88.7231 93.3665
79.1045 79.1045 80.4312 79.1045
78.9386 78.2753 80.4318 78.9386 94.7195
73.4660 75.2902 74.7927 73.9635 74.2574 73.9274
74.2952 74.7927 75.7877 75.2902 75.2475 75.0825 81.3839
71.9269 72.0930 72.5914 72.2591 72.0930 71.7608 74.0864 76.7442
Table 3 Similarity matrix based on the percentage of common distribution properties for each pair of proteins using their 3D adjacency matrices. Species
Gorilla
P. chimp.
C. chimp.
F. whale
B. whale
Rat
Mouse
Opossum
Human Gorilla P. chimp. C. chimp. F. whale B. whale Rat Mouse
76.6169
83.0846 77.2803
83.0846 77.2803 86.4013
49.2537 49.5854 49.9171 51.4096
49.5854 48.7562 49.9171 51.7413 90.7591
45.1078 44.4444 43.9469 44.1128 45.3795 44.8845
44.6103 46.6003 45.1078 45.2736 46.3696 46.8647 60.2965
41.5282 40.5316 42.1927 41.8605 41.8605 42.0266 43.3555 44.6844
Table 4 The distances for the ND5 protein sequences of nine species based on the ClustalW. Species
Gorilla
P. chimp.
C. chimp.
F. whale
B. whale
Rat
Mouse
Opossum
Human Gorilla P. chimp. C. chimp. F. whale B. whale Rat Mouse
10.7
7.1 9.7
6.9 9.9 5.1
41.0 42.7 40.1 40.4
41.3 42.4 40.1 40.4 3.5
50.2 51.4 50.2 50.8 45.3 45.0
48.9 49.9 48.9 49.6 46.8 45.9 25.9
50.4 54.0 50.1 51.4 52.7 52.7 54.0 50.8
Table 5 The correlation coefficients results for the nine ND5 proteins of our approach and other approaches in Refs. [1,8,14,17,23,25,26,29].
Human Gorilla P. chim C. chim F. whale B. whale Rat Mouse Opossum
ClustalW & Table 3 in our approach
ClustalW & Table 2 in our approach
ClustalW & Table 4 in Ref. [29]
ClustalW & ED of full 2D adjacency matrix for ND5
ClustalW & Table 4 in Ref.[1] but for ND5
ClustalW & Table 4 in Ref. [26]
ClustalW ClustalW ClustalW ClustalW & Table 6 & Table 3 & Table 4 & Table 3 in Ref.[8] in Ref.[8] in Ref. in Ref. [17] [23]
0.9961 0.9973 0.9953 0.9967 0.9940 0.9937 0.9653 0.9878 0.2257
0.9894 0.9908 0.9796 0.9914 0.9817 0.9810 0.9045 0.9062 0.1455
0.9969 0.9927 0.9851 0.9873 0.9956 0.9936 0.8912 0.8657 0.4798
0.9769 0.9869 0.9722 0.9859 0.9921 0.9892 0.9311 0.8213 0.1206
0.7378 0.2424 0.7636 0.7728 0.6299 0.6518 0.0148 0.5672 0.2892
0.9729 0.9763 0.9819 0.9756 0.9485 0.9450 0.8709 0.7296 0.5447
0.9748 0.9421 0.9739 0.9729 0.9506 0.9662 0.4418 0.6333 0.3029
in Ref. [1] is applied to ND5 proteins instead of ND6 and the correlation coefficients between the results and the results of ClustalW results are listed in the fifth column of Table 5. From Table 5 we can see that the results of our approach possess higher correlation coefficients with the results of ClustalW than other approaches. The results in Table 3 of our approach based on the 3D adjacency matrix possess higher correlation coefficients with the results of ClustalW than the results in Table 2 based on the 2D adjacency matrix. Because we have a small set of data (n = 9) which can results high correlations, the significance of correlation have been calculated to check whether the correlation of two sets of data is
0.8849 0.7398 0.8889 0.8920 0.6839 0.7296 0.8084 0.7612 0.4344
0.9236 0.9316 0.9542 0.9607 0.7388 0.8147 0.5882 0.5221 0.2992
0.9143 0.6969 0.9222 0.9257 0.6026 0.6981 0.7167 0.6711 0.4746
ClustalW & Table 4 in Ref. [17]
ClustalW & Table 3 in Ref. [25]
ClustalW & Table 4 in Ref. [25]
ClustalW & Table 3 in Ref. [14]
0.4566 0.7849 0.7861 0.7676 0.2839 0.0731 0.3693 0.4881 0.2044
0.9306 0.9293 0.8403 0.9344 0.3508 0.6486 0.4453 0.4192 0.2975
0.7177 0.7748 0.7661 0.7845 0.5318 0.5512 0.8376 0.4559 0.4326
0.9059 0.8800 0.6823 0.8819 0.3287 0.3381 0.6696 0.5914 0.1342
sufficiently strong or likely occurred by chance [25]. We checked for statistical significance for correlation coefficient values that are greater than 0.7. Our sample size equals 9, thus the degree of freedom is 7. The t-values of the r-values greater than 0.7 are listed in Table 6. The t-values of our results indicate a significance of less than 0.00002 chance of having occurred by coincidence.
4. Conclusion A novel numerical method to measure the degree of similarity of protein sequences is proposed. Similarity matrix is performed
195
A. El-Lakkani, S. El-Sherif / Chemical Physics Letters 590 (2013) 192–195 Table 6 The t-values calculated for the correlation coefficients, |r| > 0.7 based on them the significance is determined.
Human Gorilla P. chim C. chim F. whale B. whale Rat Mouse Opossum
ClustalW & Table 3 in our approach
ClustalW & Table 2 in our approach
ClustalW &Table 4 in Ref. [29]
ClustalW & ED of full 2D adjacency matrix for ND5
ClustalW & Table 4 in Ref.[1] but for ND5
ClustalW & Table 4 in Ref. [26]
ClustalW ClustalW ClustalW ClustalW & Table 6 & Table 3 & Table 4 & Table 3 in Ref.[8] in Ref.[8] in Ref. in Ref. [17] [19]
ClustalW & Table 4 in Ref. [17]
ClustalW & Table 3 in Ref. [25]
ClustalW & Table 4 in Ref. [25]
ClustalW & Table 3 in Ref. [14]
29.8695 35.9312 27.1926 32.4863 24.0435 23.4587 9.7798 16.7823 –
18.0263 19.3699 12.8971 20.0433 13.6390 13.3782 5.6113 5.6700 –
33.5229 21.7763 15.1546 16.4424 28.1106 23.2730 5.1981 4.5757 –
12.0949 16.1845 10.9852 15.5881 20.9235 17.8559 6.7536 3.8089 –
2.8918 – 3.1289 3.2217 – – – – –
11.1322 11.9352 13.7163 11.7564 7.9212 8.0408 4.6884 2.8226 –
11.5612 7.4331 11.3522 11.1322 8.1021 9.9162 – – –
– 3.3515 3.3649 3.1686 – – – – –
6.7264 6.6572 4.1009 6.9399 – – – – –
2.7269 3.2425 3.1536 3.3470 – – 4.0566 – –
5.6596 4.9019 – 4.9493 – – – – –
by calculating the percentage of common distribution properties for each pair of proteins using their 2D and 3D adjacency matrices. The values of the correlation coefficients between our results and ClustalW are the highest compared with other related approaches. The results based on the 3D adjacency matrix have higher correlation coefficients with ClustalW results than those results based on the 2D adjacency matrix. Our approach is simple, efficient and the most significant. The calculations were done in few seconds using a simple computer program in quick basic language.
[11] [12] [13] [14] [15] [16]
References
[22]
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
M. Randic´, M. Novicˇ, M. Vracˇko, SAR QSAR Environ. Res. 19 (2008) 339. Q. Dai, X.Q. Liu, Y.H. Yao, F. Zhao, J. Theor. Biol. 276 (2011) 174. S. Vinga, J. Almeida, Bioinformatics 19 (2003) 513. M. Randic´, D. Butina, J. Zupan, Chem. Phys. Lett. 419 (2006) 528. F. Bai, T.M. Wang, J. Biomol. Struct. Dyn. 23 (2006) 537. Q. Dai, X.Q. Liu, T.M. Wang, J. Theor. Biol. 247 (2007) 103. M. Randic´, J. Zupan, D.V. Topic, J. Mol. Graph. Model. 26 (2007) 290. Y.H. Yao, Q. Dai, C. Li, P.A. He, X.Y. Nan, Y.Z. Zhang, Proteins 73 (2008) 864. E. Hamori, J. Ruskin, J. Biol. Chem. 258 (1983) 1318. F. Kong, X.Y. Nan, P.A. He, Q. Dai, Y. H. Yao, IEEE 6th International Conference on, Systems Biology (ISB). (2012) 321.
[17] [18] [19] [20] [21]
[23] [24] [25] [26] [27] [28] [29] [30] [31]
5.0264 2.9091 5.1338 5.2208 – 2.8226 3.6335 2.8226 –
6.3742 6.7810 8.4386 9.1566 2.9004 3.7171 – – –
5.9723 – 6.3093 6.4749 – – 2.7190 – –
Z.H. Qi, J. Feng, X.Q. Qi, L. Li, Comput. Biol. Med. 42 (2012) 556. J.F. Yu, X. Sun, J.H. Wang, Int. J. Quantum Chem. 111 (2011) 2835. B. Liao, B. Liao, X. Lu, Z. Cao, J. Comput. Chem. 32 (2011) 2539. Z.C. Wu, X. Xiao, K.C. Chou, J. Theor. Biol. 267 (2010) 29. M. Randic´, Chem. Phys. Lett. 440 (2007) 291. P.A. He, Y.P. Zhang, Y.H. Yao, Y.F. Tang, X.Y. Nan, J. Comput. Chem. 31 (2010) 2136. J. Wen, Y.Y. Zhang, Chem. Phys. Lett. 476 (2009) 281. M.A. Gates, J. Theor. Biol. 119 (1986) 319. A. Nandy, Curr. Sci. 66 (1994) 309. P.M. Leong, S. Morgenthaler, Comput. Appl. Biosci. 11 (1995) 503. M. Randic´, M. Vracˇko, A. Nandy, S.C. Basak, J. Chem. Inf. Comput. Sci. 40 (2000) 1235. P.A. He, X.F. Li, J.L. Yang, J. Wang, MATCH Commun. Math. Comput. Chem. 65 (2011) 445. P.A. He, D. Li, Y.P. Zhang, X. Wang, Y.H. Yao, J. Theor. Biol. 304 (2012) 81. C. Li, X.Q. Yu, L. Yang, X. Zheng, Z. Wang, Phys. A 388 (2009) 1967. M.I. Abo el Maaty, M.M. Abo-Elkhier, M.A. Abd Elwahaab, Phys. A 389 (2010) 4668. P.A. He, J. Wei, Y.H. Yao, Z. Tie, Phys. A 391 (2012) 93. A. Nandy, M. Harle, S.C. Basak, ARKIVOC 9 (2006) 211. H.J. Yu, D.S. Huang, Chem. Phys. Lett. 531 (2012) 261. H.J. Yu, D.S. Huang, IEEE Trans. Comput. Biol. Bioinform. 10 (2013) 457. J.D. Thompson, D.G. Higgins, T.J. Gibson, Nucleic Acids Res. 22 (1994) 4673. R. Chenna, H. Sugawara, T. Koike, R. Lopez, T.J. Gibson, D.G. Higgins, J.D. Thompson, Nucleic Acids Res. 31 (2003) 3497.