ALIGN_MTX—An optimal pairwise textual sequence alignment program, adapted for using in sequence-structure alignment

ALIGN_MTX—An optimal pairwise textual sequence alignment program, adapted for using in sequence-structure alignment

Computational Biology and Chemistry 33 (2009) 235–238 Contents lists available at ScienceDirect Computational Biology and Chemistry journal homepage...

224KB Sizes 1 Downloads 143 Views

Computational Biology and Chemistry 33 (2009) 235–238

Contents lists available at ScienceDirect

Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem

Brief Communication

ALIGN MTX—An optimal pairwise textual sequence alignment program, adapted for using in sequence-structure alignment Boris Vishnepolsky ∗ , Malak Pirtskhalava Institute of Molecular Biology and Biological Physics, 12 Gotua St., Tbilisi, 0160, Georgia

a r t i c l e

i n f o

Article history: Received 28 October 2008 Received in revised form 26 March 2009 Accepted 23 April 2009 Keywords: Sequence alignment Substitution matrix Threading Text analysis

a b s t r a c t The presented program ALIGN MTX makes alignment of two textual sequences with an opportunity to use any several characters for the designation of sequence elements and arbitrary user substitution matrices. It can be used not only for the alignment of amino acid and nucleotide sequences but also for sequencestructure alignment used in threading, amino acid sequence alignment, using preliminary known PSSM matrix, and in other cases when alignment of biological or non-biological textual sequences is required. This distinguishes it from the majority of similar alignment programs that make, as a rule, alignment only of amino acid or nucleotide sequences represented as a sequence of single alphabetic characters. ALIGN MTX is presented as downloadable zip archive at http://www.imbbp.org/software/ALIGN MTX/ and available for free use. As application of using the program, the results of comparison of different types of substitution matrix for alignment quality in distantly related protein pair sets were presented. Threading matrix SORDIS, based on side-chain orientation in relation to hydrophobic core centers with evolutionary change-based substitution matrix BLOSUM and using multiple sequence alignment information position-specific score matrices (PSSM) were taken for test alignment accuracy. The best performance shows PSSM matrix, but in the reduced set with lower sequence similarity threading matrix SORDIS shows the same performance and it was shown that combined potential with SORDIS and PSSM can improve alignment quality in evolutionary distantly related protein pairs. © 2009 Elsevier Ltd. All rights reserved.

1. Introduction Despite the long-term investigations, the problem of protein 3D structure prediction by its amino acid sequence is still far from being solved. There are some main approaches to solve fold recognition problem. One approach involves direct sequence-sequence comparison (Henikoff and Henikoff, 2000; Tan et al., 2006). The comparison is carried out mainly by using different evolutionary change-based substitution matrices (ECBSM) (Dayhoff et al., 1978; Gonnet et al., 1992; Henikoff and Henikoff, 1992, 2000). Another approach is sequence-structure threading in which the compatibility of a sequence with each known structure is assessed by a structure-derived score function (or structural profile) (Bowie et al., 1991; Bryant and Lawrence, 1993; Godzik et al., 1992; Jones et al., 1992; Kocher et al., 1994; Ouzounis et al., 1993; Rost, 1995; Sippl and Weitckus, 1992; Skolnick and Kihara, 2001). A promising way to improve the quality of fold recognition is also to employ multiple sequence alignments, which provide information about

∗ Corresponding author. Tel.: +995 32 374597; fax: +995 32 371733. E-mail addresses: [email protected] (B. Vishnepolsky), [email protected] (M. Pirtskhalava). 1476-9271/$ – see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2009.04.003

structural and functional relationships within a group of related proteins, such as an evolutionarily conservative region or conserved hydrophobicity patterns. This information can be represented by a variety of formulations, such as motifs, profiles (Gribskov et al., 1987), position-specific score matrices (PSSM) (Altschul et al., 1997) and HMMs (Sjolander et al., 1996). In all cases, a key for accurate prediction is the quality of sequence alignments between the query sequence and the template protein structure. The accuracy of alignments depends on the used substitution matrices and for different alignment pairs the best results can be obtained with different substitution matrices. Therefore it is important to compare different substitution matrices for best alignment quality. The matrices used in the above-mentioned approaches are of different types (they have various sizes, used alphabetic characters, etc.), but the existing alignment programs can use only fixed type of matrices and they cannot be used with matrices of different types. In the given work there is presented a program which makes pairwise alignment for different types of substitution matrices such as ECBSM, threading matrices, PSSM and others, that can have various sizes and the used alphabetic characters. Here are also shown the results of comparison of alignment quality of different substitution matrices for distantly related proteins.

236

B. Vishnepolsky, M. Pirtskhalava / Computational Biology and Chemistry 33 (2009) 235–238

2. Methods

Table 1 Set of structurally similar proteins used for testing alignment accuracy.

2.1. Difference ALINGN MTX from Other Alignment Programs

Query sequence

Template structure

Length of query sequence

Length of template structure

Sequence similarity %

193l 1aba 1agja 1ak1 1ash 1ash 1ax4a 1bina 1bina 1btn 1btn 1cewi 1cpca 1cpca 1dat 1dat 1dvra 1dynb 1dynb 1ecpa 1fuia 1gsa 1gsa 1gtpa 1hce 1hfc 1i1b 1irsa 1lpe 1lpe 1ltid 1ltid 1ltid 1ouna 1pgs 1plq 1ryt 1ryt 1sbp 1spbp 1ste 1tdj 1tiid 1ulo 1urna 1vin 1vlta 1wba 1wba 1xika 1xsm 2alp 2dri 2fb4h 2fb4h 3nll 3ulla 6ldh

153l 1gp1a 1elg 2dri 1bvd 1cpca 1cl1a 1bvd 2hbg 1irsa 1mai 1ouna 1cola 2hbg 1afrf 1xika 1dts 1irsa 1mai 1ula 1bhs 1bnca 1iow 1gtqa 4fgf 1iag 4fgf 1mai 1nbbb 1vlta 1bcpl 1prtb 1tiid 1std 1phm 2pola 1xika 1xsm 2abh 1nuea 3ulla 1psda 3ulla 2ayh 1spbp 1ad6 1nbbb 1i1b 4fgf 1afrf 1afrf 1hava 1rnl 1fna 1tupa 1qrdb 1bcpl 3nll

129 87 242 308 147 147 465 143 143 106 106 108 162 162 174 174 220 113 113 237 591 314 314 221 118 157 151 112 144 144 103 103 103 125 311 258 190 190 309 71 238 494 98 152 96 252 142 171 171 340 288 198 271 229 229 138 106 329

185 184 240 271 153 162 391 153 147 112 119 125 197 147 345 340 220 112 119 289 284 433 306 138 124 201 124 119 129 142 98 196 98 162 305 366 340 288 321 151 106 404 106 214 71 185 129 151 124 345 345 216 200 91 196 273 98 138

13 4 15 8 15 7 11 13 15 15 12 10 15 6 9 7 13 13 13 15 15 11 11 14 14 15 14 14 10 10 15 15 13 11 4 14 14 12 12 10 7 6 6 13 8 10 12 11 6 14 11 10 8 11 4 15 15 10

There are many programs which make alignment of two textual sequences (see for example DPALIGN (http://www.ira.cinvestav. mx:8080/bioperl/Bio/Tools/dpAlign.html), DOTLET (http://myhits. isb-sib.ch/cgi-bin/dotlet), ALION (http://motif.stanford.edu/ distributions/alion/), LALIGN (http://www.ch.embnet.org/software/ LALIGN form.html), ALIGN (http://www.ebi.ac.uk/Tools/emboss/ align/), SIM (http://www.expasy.org/tools/sim.html), etc.), however the majority of them are intended for alignment of biological sequences such as DNA, RNA or protein sequences. In this case amino acid or nucleotide sequences are represented as a set of single character elements (here and further we will name these elements as sequence elements), each of them is designated by one letter (character) corresponding to certain amino acid or nucleotide. However, in some cases it is necessary to make alignment of sequences whose elements are designated not by one but several characters. So, in threading, one sequence, whose elements are types of amino acids designated by single alphabetic characters, is threaded on the other sequence, the elements of which are characterized by structural environment of the corresponding amino acid and are designated, as a rule, by a certain number or a set of numbers which may consist of several characters. Another example is the use of position-specific score matrices (PSSM) (Altschul et al., 1997) for amino acid sequence alignment. PSSM represents L × 20 matrix (where L—is the length of amino acid sequence) which is generated by multiple alignment of a query sequence against protein database and estimates probability of the location of amino acid of certain type in each position of the amino acid sequence in the query protein. The given matrix is generated and used in programs PSI-BLAST, PSI-PRED and in many threading programs. By using PSSM an amino acid sequence number (position) in query protein serves as a sequence element, which, obviously, may consist of several characters. Thus, it is possible to tell, that though multi-character designations of sequence elements are used in various programs (threading, PSI-BLAST, etc.), the corresponding substitution matrices are generated, as a rule, directly in these programs. At the same time, there are practically no separately available programs, allowing making alignment with the user substitution matrices, which elements may be designated by several characters. In the given paper there has been presented an optimal sequence alignment program ALIGN MTX, which allows using up to 10 symbols as descriptions of sequence elements and arbitrary user substitution matrices. While using threading the program also allows taking into account the data of secondary structure prediction of the query amino acid sequence. ALIGN MTX is written in FORTRAN that makes use of the dynamic programming algorithm, described by Nidelman and Wunsch (1970) and the extensions to that algorithm, developed by Smith and Waterman (1981) and Gotoh (1982). The dynamic programming algorithm is a technique from mathematics for finding the minimum or maximum of a discrete function. It has found use not only in biological sequence comparisons, but also in analyzing the relatedness of bird songs, in matching geological features distorted across faults, gas chromatography, speech recognition, and in general text analysis. In general text analysis the dynamic programming method can be used to compute the minimum number of characters that have to be changed in order to convert one word into another (or one sentence into another). The implementations of the dynamic programming algorithm available in the ALIGN MTX program is flexible enough to be easily adapted to these and similar uses. The offered program also allows using both global and local alignment and different variants of global–local alignment when the beginnings and the ends of the alignment

sequences can be penalized or not. The detailed description of the used dynamic programming algorithm is available in the software package or can be found at http://www.imbbp.org/software/ALIGN MTX/algorithm.html. 2.2. Using ALIGN MTX The software package is available at website www.imbbp.org/ software/ALIGN MTX/ as zip archive and works under OS Windows 98/2000/XP. Parameters of the alignments are set in dialog window. Users can compute several alignments when running program once. When

B. Vishnepolsky, M. Pirtskhalava / Computational Biology and Chemistry 33 (2009) 235–238

237

Table 2 Optimal set of gap penalties and variant of dynamic programming method for different substitution matrices.

SORDIS PSSM BLOSUM45 RANDOM

Gap opening penalty

Gap extension penalty

Variant of dynamic programming methoda

2.0 11.0 13.6 2.2

0.6 1.0 2.4 2.0

Ns + Np + Cs − Cp + Global Global Ns − Np + Cs + Cp

a (+) penalized; (−) not penalized; (N) N-termini; (C) C-termini; (s) query sequence; (p) template structure. For example Ns + Np + Cs + Cp − means that it is penalized the N-termini of the query sequence, penalized the N-termini of the template structure, penalized the C-termini of the query sequence and not penalized the C-termini of the template structure. Global alignment means that all ends are penalized.

obtaining errors in setting parameters the program sends messages about them and returns to the dialog window with the last setting parameters that allows correcting them easily. In case of taking into account information of secondary structure prediction in threading it is also necessary to set secondary structure prediction data files of the query protein in a format, created by program PREDATOR (Frishman and Argos, 1996) and coordinates of the three-dimensional structure of the protein on which the query protein sequence is threaded in a format, created by program DSSP (Kabsch and Sander, 1983). The detailed description of using the program and the corresponding file formats are available at the website. 2.3. Comparison of Alignment Quality for Different Substitution Matrices As it has been mentioned above alignment quality depends on the types of pairs of the query sequence and the template structure. As was shown (Hobohm and Sander, 1995; Holm et al., 1992) for evolutionary close pairs with high sequence similarity (>20–30%) ECBSM matrices have good results for alignment quality and fold recognition, but for pairs with low sequence similarity alignment quality in the most cases is weak and other approaches are required to improve alignment accuracy. In the given paper there are presented results of comparison alignment quality for distantly related protein pairs. The following substitution matrices were taken: based on side-chain orientation threading matrix SORDIS (Vishnepolsky et al., 2008), one of the most popular for distantly related proteins ECBSM matrix BLOSUM45, position-specific score matrix PSSM and random matrix. The alignment quality was tested by using ProSup benchmark, that was prepared by Sippl’s group to test the alignment accuracy (Domingues et al., 2000). The set consists of 127 pairs of proteins

with correct alignments obtained by structural alignment program ProSup. For the evaluation alignment accuracy for only distantly related protein, a set of 58 pairs with sequence identity <16% was selected (see Table 1). The accuracy of an alignment was obtained by calculating the percent of matches between correct alignment and the alignment made by ALIGN MTX for different substitution matrices. An alignment is counted as correct here when more than 50% of the positions in the alignment match within ±2 residue shifts, with the reference alignment prepared by ProSup program. The alignment was conducted using different variants of dynamic programming method (http://www.imbbp.org/software/ ALIGN MTX/algorithm.html). Dynamic programming algorithms require specifying the values of the used gap penalties. Optimal set of gap opening, gap extension penalties and variant of dynamic programming method for BLOSUM, SORDIS and random matrices were obtained, by method (Vishnepolsky et al., 2008) based on their performance in fold recognition experiments. PSSM was generated using PSI-BLAST (Altschul et al., 1997) with NCBI server (http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?PAGE=Proteins) with the default values for gap penalties. These values were also used for alignment by ALIGN MTX. The used values of gap penalties for different substitution matrices are presented in Table 2. 3. Results The results of the comparison of alignment accuracy of SORDIS with BLOSUM45, PSSM and random matrices were presented in Fig. 1. The figure shows the fraction of correct alignments in the full data set and in the reduced set of 30 pairs with less than 12% identity. As was shown BLOSUM has the worst performance in both sets, and PSSM shows the best performance in the full set and the same performance as SORDIS in the reduced set. It can be also noted, that for the reduced set only 33% pairs with correct alignment, obtained

Fig. 1. The fraction of correct alignments computed with different matrices in (a) the full set (with <16% of sequence similarity) and (b) the reduced set (with <12% of sequence similarity).

238

B. Vishnepolsky, M. Pirtskhalava / Computational Biology and Chemistry 33 (2009) 235–238

by PSSM and SORDIS matrices are identical and for the other 67% pairs correct alignments were obtained either by PSSM or by SORDIS. For the full set, the percent of common correct alignments for both matrices is higher (60% for SORDIS and 53% for PSSM). Therefore combined potential of using SORDIS, which takes into account structural information of the template protein and PSSM, which utilizes multiple sequence alignment information, should take the best performance for alignment accuracy especially for the pairs, which have very low sequence similarity. 4. Conclusion The presented program ALIGN MTX makes alignment of two textual sequences with an opportunity to use any several characters for the designation of sequence elements and arbitrary user substitution matrices. It can be used not only for the alignment of amino acid and nucleotide sequences but also for sequence-structure alignment used in threading, amino acid sequence alignment, using PSSM matrix, and in other cases when alignment of biological or non-biological textual sequences is required. This distinguishes it from the majority of similar alignment programs, which make, as a rule, alignment only of amino acid or nucleotide sequences represented as a sequence of single alphabetic characters. We think that the presented program can be useful for scientists whose interest is the development of threading potentials and for those who require alignment of textual sequences. As application of using the program, the results of comparison of different substitution matrices for alignment quality in distantly related protein pair sets were presented. The best performance shows PSSM matrix, using multiple sequence alignment information, but in the reduced set with lower sequence similarity threading matrix SORDIS shows the same performance and it was shown that the combined potential with SORDIS and PSSM can improve alignment accuracy in evolutionary distantly related protein pairs. Acknowledgement The designated work has fulfilled by financial support of the Georgia National Science Foundation (Grant GNSF/ST07/6-239). References Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389–3402.

Bowie, J.U., Luethy, R., Eisenberg, D., 1991. A method to identify protein sequences that fold into a known three-dimensional structure. Science 253, 164–169. Bryant, S.H., Lawrence, C.E., 1993. An empirical energy function for threading protein sequence through the folding motif. Proteins 16, 92–112. Dayhoff, O., Schwartz, R.M., Orcutt, B.C., 1978. A model of evolutionary change in proteins. In: Dayhoff, M.O. (Ed.), Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, DC, pp. 345–352. Domingues, F.S., Lackner, P., Andreeva, A., Sippl, M.J., 2000. Structure-based evaluation of sequence comparison and fold recognition alignment accuracy. J. Mol. Biol. 297, 1003–1013. Frishman, D., Argos, P., 1996. Incorporation of long-distance interactions into a secondary structure prediction algorithm. Protein Eng. 9, 133–142. Godzik, A., Kolinski, A., Skolnick, J., 1992. Topology fingerprint approach to the inverse protein folding problem. J. Mol. Biol. 227, 227–238. Gonnet, G.H., Cohen, M.A., Benner, S.A., 1992. Exhaustive matching of the entire protein sequence database. Science 256, 1433–1445. Gotoh, O., 1982. An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705–708. Gribskov, M., McLachlan, A.D., Eisenberg, D., 1987. Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84, 4355–4358. Henikoff, S., Henikoff, J.G., 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89, 10915–10919. Henikoff, S., Henikoff, J.G., 2000. Amino acid substitutes matrices. Adv. Protein Chem. 54, 73–96. Hobohm, U., Sander, C., 1995. A sequence property approach to searching protein databases. J. Mol. Biol. 251, 390–399. Holm, L., Ouzounis, C., Sander, C., Tuparev, G., Vriend, G., 1992. A database of protein structure families with common folding motifs. Protein Sci. 1, 1691–1698. Jones, T.D., Taylor, W.R., Thronton, J.M., 1992. A new approach to protein fold recognition. Nature 358, 86–89. Kabsch, W., Sander, C., 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637. Kocher, A.J.P., Rooman, M.J., Wodak, S.J., 1994. Factors influencing the ability of knowledge-based potentials to identify native sequence-structure matches. J. Mol. Biol. 235, 1598–1613. Nidelman, S.B., Wunsch, C.D., 1970. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48, 443–453. Ouzounis, C., Sander, C., Scharf, M., Schneider, R., 1993. Prediction of protein structure by evaluation of sequence-structure fitness. Aligning sequences to contact profiles derived from three-dimensional structures. J. Mol. Biol. 232, 805–825. Rost, B., 1995. TOPITS: threading one-dimensional predictions into threedimensional structures. ISMB 3, 314–321. Sippl, M.J., Weitckus, S., 1992. Detection of native-like models for amino acid sequences of unknown three-dimensional structure in a data base of known protein conformations. Proteins 13, 258–271. Sjolander, K., Karplus, K., Brown, M., Hughet, R., Krogh, A., Mian, I., Haussler, D., 1996. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput. Appl. Biosci. 12, 327–345. Skolnick, J., Kihara, D., 2001. Defrosting the frozen approximation: PROSPECTOR—a new approach to threading. Proteins 42, 319–331. Smith, T.F., Waterman, M.S., 1981. The identification of common molecular subsequences. J. Mol. Biol. 147, 195–197. Tan, Y.H., Huang, H., Kihara, D., 2006. Statistical potential-based amino acid similarity matrices for aligning distantly related protein sequences. Proteins 64, 587–600. Vishnepolsky, B., Managadze, G., Pirtskhalava, M., 2008. Comparison of the efficiency of evolutionary change-based and side chain orientation-based fold recognition potentials. Proteins 71, 1863–1878.