A feature vector integration approach for a generalized support vector machine pairwise homology algorithm

A feature vector integration approach for a generalized support vector machine pairwise homology algorithm

Computational Biology and Chemistry 32 (2008) 458–461 Contents lists available at ScienceDirect Computational Biology and Chemistry journal homepage...

352KB Sizes 0 Downloads 70 Views

Computational Biology and Chemistry 32 (2008) 458–461

Contents lists available at ScienceDirect

Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem

Brief Communication

A feature vector integration approach for a generalized support vector machine pairwise homology algorithm Bobbie-Jo M. Webb-Robertson a,∗ , Christopher S. Oehmen a , Anuj R. Shah b a b

Computational Biology & Bioinformatics, Pacific Northwest National Laboratory, United States Scientific Data Management, Pacific Northwest National Laboratory, United States

a r t i c l e

i n f o

Article history: Received 1 April 2008 Received in revised form 23 June 2008 Accepted 2 July 2008 Keywords: Homology Support vector machine Integration

a b s t r a c t Due to the exponential growth of sequenced genomes, the need to quickly provide accurate annotation for existing and new sequences is paramount to facilitate biological research. Current sequence comparison approaches fail to detect homologous relationships when sequence similarity is low. Support vector machine (SVM) algorithms approach this problem by transforming all proteins into a feature space of equal dimension based on protein properties, such as sequence similarity scores against a basis set of proteins or motifs. This multivariate representation of the protein space is then used to build a classifier specific to a pre-defined protein family. However, this approach is not well suited to large-scale annotation. We have developed a SVM approach that formulates remote homology as a single classifier that answers the pairwise comparison problem by integrating the two feature vectors for a pair of sequences into a single vector representation that can be used to build a classifier that separates sequence pairs into homologs and non-homologs. This pairwise SVM approach significantly improves the task of remote homology detection on the benchmark dataset, quantified as the area under the receiver operating characteristic curve; 0.97 versus 0.73 and 0.70 for PSI-BLAST and Basic Local Alignment Search Tool (BLAST), respectively. © 2008 Elsevier Ltd. All rights reserved.

1. Introduction Numerous algorithms have been developed to infer homologous relationships between protein sequences, often termed homology detection. Among these, the Basic Local Alignment Search Tool (BLAST) and its derivatives, such as PSI-BLAST, are arguably the most visible in the community (Altschul et al., 1990, 1997). BLAST however comes with a trade-off of sensitivity, thus recent years have seen an explosion in approaches to apply statistical discriminative algorithms, primarily support vector machines (SVMs), to the problem of remote homology detection (Hou et al., 2003; Jaakkola et al., 1999; Leslie et al., 2004; Liao and Noble, 2003; Webb-Robertson et al., 2005). These new algorithms have led to a dramatic improvement in sensitivity, but have been limited in their use to a small set of select protein families with sufficient membership to train the classifier. The most common approach to apply SVMs to remote homology detection is to generate a set of n classifiers, each trained to

∗ Corresponding author at: Computational Biology & Bioinformatics, 902 Battelle Boulevard, P.O. Box 999, Richland, WA 99352, United States. Tel.: +1 509 375 2292; fax: +1 509 372 4720. E-mail address: [email protected] (B.-J.M. Webb-Robertson). 1476-9271/$ – see front matter © 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2008.07.017

distinguish a specific protein family. These methods are in essence answering a different question than traditional sequence–sequence or sequence–profile methods. “Does this protein belong to family X? “versus” Are these two sequences homologous?”. The traditional question of homology is the more relevant for many applications in bioinformatics, such as large-scale genome annotation or discovery of new relationships for small or yet to be defined families.

2. Methods We have developed the SVM HOmology Tool (SHOT) that combines the sensitivity of SVMs and the flexibility of pairwise (rather than family-based) homology detection. SHOT overcomes the limitations of family-based methods by providing a single classifier to determine if a pair of sequences is homologous. SHOT begins by transforming the query into a vector representation, similar to the family-based approaches. The second and third steps are unique to the SHOT methodology. The second step consists of comparing the vector query sequence (or database) to generate a paired vector representation. The final step runs this paired vector sequence through the SHOT classifier to determine if these two sequences are homologous.

B.-J.M. Webb-Robertson et al. / Computational Biology and Chemistry 32 (2008) 458–461

2.1. Feature Integration: SHOT Pairwise Vector Representation The primary differentiating factor between SVM remote homology detection methods is how the variable length protein sequences are transformed into a multivariate representation, or the subsequent kernel representation. In general terms, a query sequence Q is represented as a set of m scores against a set of sequences or motifs (R1 , . . ., Rm ), FQ = [s(Q, R1 ), s(Q, R2 ), · · ·, s(Q, Rm )],

(1)

where s(Q,Rk ) could be, for example the Simth-Waterman E-value (Smith and Waterman, 1981) between Q and Rk . We demonstrate the SHOT methodology on vectors previously developed for SVM family-based homology detection, SVM-Pairwise (Liao and Noble, 2003). SVM-Pairwise is an implementation that describes the vector of a query protein as a collection of E-values resulting from the Smith-Waterman algorithm run against the Structural Classification of Proteins (SCOP) 1.53 database (4352 sequences), creating a vector in Eq. (1) where s(Q,Rk ) is the Smith-Waterman E-value for m = 4352 sequences. These scores are publicly available at http://noble.gs.washington.edu/papers/. To develop a pairwise homology SVM algorithm, a pair of sequences need to be defined as a single vector. SHOT performs this task by defining the score for a pair of sequences, Q1 and Q2 , as the element-wise multiplication of the SVM-Pairwise vectors,

459

feature space that can be linearly separated. This is performed by solving a quadratic programming problem on the data mapped to a kernel function. The final classification is simply a linear computation, f (x) =



˛i K(x, xi ) + b,

(3)

i

 b) defines the sepwhere K defines the kernel transform and (˛, arating hyperplane. For the development of SHOT we used a normalized quadratic kernel with a constant of 10 and coefficient of 1. Solving the quadratic programming problem for training a SVM is not trivial for large dense kernel matrices since training on one of the datasets described in Section 2.2 requires kernel matrix of size 37627 × 37627. Training was done using GIST (http://bioinformatics.ubc.ca/gist) on an SGI Altix with 256 Gb of shared memory. 3. Results and Discussion

This vector yields an overall similarity metric to the ‘basis’ set of 4352 sequences. For pairs of sequences that are not homologous the multiplicative values will generally be larger than those for nonhomologs. The SVM for SHOT is trained to recognize sequence pairs which are homologous by exploiting their similarity to a common subset of the basis sequences.

The most widely used tool by the biological community in the world of remote homology detection is BLAST. BLAST has become a phenomenon in the biological community because it is both fast and easy to use. Changes to include more accurate familybased iterative methods, such as PSI-BLAST, and its applications on high-performance computing architectures have kept it in the mainstream. However, BLAST does come at a cost: it identifies fewer remote homologs at the same false discovery rate than slower optimization-based approaches. PSI-BLAST addresses some accuracy issues but the iterative nature of the algorithm increases computation time. This speed–accuracy trade-off is the common problem in the world of remote homology detection algorithms. In the evaluation of this new algorithm we address these two issues.

2.2. Training and Testing Data Files

3.1. Accuracy and Reproducibility

The SCOP 1.53 database has been used extensively for testing remote homology algorithms. This SCOP database contains 4352 sequences carefully selected such that sequences of high redundancy at the residue level are removed. Given the known structural class of each sequence it can be grouped based on domain, fold, superfamily and family. Typically proteins that belong to the same superfamily in SCOP are considered homologous (Cheek et al., 2004; Murzin et al., 1995). In a pairwise analysis, for SCOP 1.53, there are 9 467 776 unique pairs. Training a SVM at this scale on a dense data matrix is virtually impossible using any current SVM implementations. Of the over nine million sequence pairs in SCOP 1.53, approximately 1% (94 065) are defined as true homologous pairs at the superfamily level. To represent the space of true positives ∼20% of the homologous pairs are randomly selected for the training set (18 813 pairs). An equal number of negative examples are selected from the more than nine million defined non-homologous pairs. The test set contained the remaining 75 252 positives and again an equal number of randomly selected negatives. This procedure was repeated nine times to test the robustness of the final classifier over different selections of positive and negatives examples. We use a combination of highperformance computational hardware and software capabilities to train on this statistically sampled subset of the SCOP 1.53 pairwise representation of the dataset.

Receiver operating characteristic (ROC) curves are a standard approach to comparing the accuracy of different classifiers (Gribskov and Robinson, 1996). The ROC curve gives an overall view of the false discovery rate at any sensitivity threshold. The area under this curve (AUC) is a good overall metric of algorithm performance since an AUC of 1 means the classifier finds all true relationships without tolerating any false positives. We use this basic ROC strategy to compare SHOT to BLAST and PSI-BLAST. We do not compare SHOT to other SVM algorithms because they are based on family membership whereas SHOT is focused on pairwise relationships where the SVM score is analogous to a BLAST or PSI-BLAST E-value. As described in Section 2.2 since only a small fraction of the negative examples are included in the training phase 9 files were randomly generated to evaluate the overall accuracy statistics with different samples of non-homologous pairs. Since each file generation was random only a subset of the positive examples will be the same across training sets and it is unlikely that more than a few negative examples would be identical given the sample is of size 18 813 from a space of over 9 million. The testing files met the same criteria with the additional constraint that a testing pair could not be present in the corresponding training file to assure testing on an independent population from the training pairs. Since training is not required for BLAST algorithms the BLAST Evalue scores were attained for all pairs in each of the test files using ScalaBLAST (Oehmen and Nieplocha, 2006). PSI-BLAST scores were attained using the NCBI publicly available software—PSI-BLAST was allowed to build the position specific scoring matrix (PSSM) for a query by searching against the NR database for up to 20 iterations.

FQ 1 Q w = [s(Q 2 , R1 ) s(Q 2 , R1 ), · · ·, s(Q 1 , Rm ) s(Q 2 , Rm )].

(2)

2.3. SVM Training SVMs are a state-of-the-art machine learning tool for building classification models by mapping the data into a high-dimensional

460

B.-J.M. Webb-Robertson et al. / Computational Biology and Chemistry 32 (2008) 458–461

Fig. 1. The ROC curves of SHOT, BLAST and PSI-BLAST representing the false positive rate vs. true positive rate for the first training file. The AUC for SHOT is 0.966, which is significantly more accurate than BLAST, AUC of 0.700, and PSI-BLAST, AUC of 0.732, at p-values of <0.0001 in both cases using a Signed-Rank non-parametric test.

The query defined PSSM was then used to score that sequence against the remaining sequences in SCOP 153. The E-values from BLAST and PSI-BLAST are compared against SHOT in Fig. 1. For both BLAST and PSI-BLAST typically a small E-value threshold is used. This however only returns a small number of identifications, so to evaluate full false positive range the E-value threshold for returning a score for a pair of sequences was set to 1e+50. All pairs with an E-value of greater than 1e+50 were set to this arbitrary constant. This constant resulted in an uncharacteristic jump in the ROC curves when this threshold is reached, but this area associated with a large false positive rate (FPR) is typically not of interest in remote homology detection. Overall, the increased sensitivity of SHOT over the BLAST algorithms at any specificity level is apparent and can be quantified with an AUC of ∼0.966 versus ∼0.732 for PSI-BLAST and ∼0.700 for BLAST. Fig. 1 is an example of the ROC plots attained for one training file, however resulting ROC curves over all nine runs are essentially indistinguishable by eye. Table 1 gives the statistics associated with the nine randomly generated training files discussed in Section 2.2 with respect to mean, median and standard deviation values. Overall, the three algorithms have very good reproducibility over the nine randomly selected test sets. Typically, sequence analysis software seeks a significance threshold associated with a desired FPR. This FPR assures that only a small percentage of identified homologs will be incorrect relationships. The number of true relationships returned is dependent upon the homology identification algorithm. These values can be identified by setting a FPR threshold and determining the associated true positive rate (TPR) on the y-axis. Fig. 2 illustrates the TPR at a 1% and 5% FPR threshold. There was a TPR improvement of SHOT to ∼78.5% from ∼51.2% for PSI-BLAST. This means of the 150 504 sequence pairs evaluated (75 252 homologous and Table 1 Statistics of reproducibility using different randomly selected training sets from SCOP 1.53 Algorithm

Mean

Median

Standard deviation

BLAST PSI-BLAST SHOT

0.700 0.734 0.965

0.700 0.734 0.965

6.76e−4 6.98e−4 11.00e−4

Fig. 2. The TPR for BLAST, PSI-BLAST and SHOT are 48.8, 51.2 and 78.5, respectively at a 1% FPR. These values increase to 53.1, 55.0 and 88.8, respectively, for a less conservative FPR of 5%.

75 252 non-homologous) at a 1% FPR there have been 752 false positive identifications (non-homologous pairs wrongly classified) and 38 529 and 59 073 correctly classified for PSI-BLAST and SHOT, respectively. This is a gain of over 20 000 more of the possible homologous relationships identified by SHOT than PSI-BLAST at the same 1% FPR. For a 5% FPR SHOT identified over 25 000 more homologous pairs than PSI-BLAST and BLAST. 3.2. Speed–Computational Costs Performing remote homology fast can be as important as doing it accurately when performing large-scale searches. Remote homology algorithms are typically used in a database search fashion to identify as many evolutionarily related sequences to the query as possible and, in some cases, to perform complete species–species comparisons. SHOT would be most powerful for these large database searches as it does not provide physical alignments, but allows much more accurate identification of relationships. Classification with SHOT requires three steps: (1) vectorization of the query sequence; (2) computation of Eq. (2) for each vectorized sequence in the database; (3) calculation of the SVM score for classification. The first step of query vectorization adds negligible computation time; it is simply the computation of the Smith-Waterman score for each of the descriptor sequences (in this case 4352), which requires only a few seconds to compute (Rognes and Seeberg, 2000; Saebo et al., 2005). In using SHOT, the database that it would be compared against is not a sequence database but a database of vectorized sequences pre-computed in the same manner as the query. This cost is incurred by the developers and is thus invisible to the user. In the second step, the vectorized query has an element-wise comparison to each of the sequence vectors in the database. This generates Eq. (2) for each database sequence, which makes it possible to determine if that sequence is a homolog to the query. This again is a negligible computational cost and can be reduced further using a reduced-dimensionality representation of the data. Using an alternate vectorization strategy, such as I-sites sequence–structure motifs (Bystroff and Baker, 1998; Hou et al., 2003), the data could be represented in 20–50 elements without degrading the performance of the classifier. Finally, the actual classification task is the rate-determining step. It can be performed by organizing the collection of converted sequences into a large matrix and multiplying it into a matrix containing all the support vectors.

B.-J.M. Webb-Robertson et al. / Computational Biology and Chemistry 32 (2008) 458–461

461

identify a true basis set of proteins for vectorization that take into account the diversity of protein structures. With this in mind vectorization strategies that use structural properties, such as I-Sites (Bystroff and Baker) are under investigation. The second topic of primary interest is the sampling of the negative space. This article uses random sampling over multiple training and testing files, but the ability to define a set of negatives that represent a consensus of the entire space would be of considerable value. Acknowledgements

Fig. 3. Run-time for calculating degree of homology between a single sequence and a database of one million sequences is comparable for SHOT and PSI-BLAST for a data dimension of 50.

To demonstrate that these operations can be performed in a time similar to the run-time of PSI-BLAST for real-world applications, we ran PSI-BLAST for varying numbers of iterations for a small collection of proteins. We also performed the collection of vectorization, matrix–matrix, and matrix–vector products on the same architecture for varying problem sizes to estimate the total run-time for a SHOT implementation. At the scale of a data representation of 50 variables and approximately 4000 support vectors, the expected run-time for classification of one query against a database of 1 million proteins is comparable to a PSI-BLAST run using 5 iterations (scaled to a database size of 1 million proteins). Even SHOT classifiers with significantly more support vectors, ∼13 000, still have a run-time significantly less than the average PSI-BLAST run at the full 20 iterations, Fig. 3. For the results shown in Fig. 1, approximately 11 000 support vectors were selected by the SVM training. 3.3. Conclusions and Future Work SHOT demonstrates an approach to answer the pairwise homology problem using a SVM that has the potential to be competitive on computation time with heuristic-based algorithms. The novelty of the method comes in the pairwise vectorization strategy, Eq. (2), integrated with the SVM learning algorithm. In implementing this overall framework, the largest challenge was associated with the SVM training on the large dense matrix of pairwise vectors. However, recent algorithms, such as PSVM (Chang et al., 2007) have minimized this cost. Additionally, the nine cross-validation analyses demonstrated that the method is robust to the sequences selected for training on the SCOP 1.53 benchmark dataset. The caveat is that this basis set of proteins only represents a small fraction of the overall global protein space and as such scaling is a major challenge with the current feature vector representation. This work is the first step in developing a robust pairwise SVM application for remote homology detection. Several directions are being investigated currently. The main area of interest is in identification of a minimal set of descriptors that do not rely on a sequence similarity score such as an E-value. In general, it will be difficult to

We acknowledge the useful feedback of William Cannon, Karin Rodland, Lee Ann McCue and Jason McDermott at Pacific Northwest National Laboratory (PNNL). This work was supported by the U.S. Department of Energy (DOE) through the Data-Intensive Computing Initiative with the Laboratory Directed Research and Development program at PNNL and the National Science Foundation under contract 53836A. Computational support was provided by the Molecular Sciences Computing Facility at PNNL under proposal 29393. PNNL is a multi-program national laboratory operated by Battelle for the U.S. DOE under contract DE-AC05-76RL01830. References Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. Bystroff, C., Baker, D., 1998. Prediction of local structure in proteins using a library of sequence–structure motifs. J. Mol. Biol. 281, 565–577. Chang, E.Y., Zhu, K., Wang, H., Bai, H., Li, J., Qiu, Z., Cui, H., 2007. Parallelizing support vector machines on distributed computers. NIPS. Cheek, S., Qi, Y., Krishna, S.S., Kinch, L.N., Grishin, N.V., 2004. 4SCOPmap: automated assignment of protein structures to evolutionary superfamilies. BMC Bioinformatics 5, 197. Gribskov, M., Robinson, N.L., 1996. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem. 20, 25–33. Hou, Y., Hsu, W., Lee, M.L., Bystroff, C., 2003. Efficient remote homology detection using local structure. Bioinformatics 19, 2294–2301. Jaakkola, T., Diekhans, M., Haussler, D., 1999. Using the Fisher kernel method to detect remote protein homologies. Proc. Int. Conf. Intell. Syst. Mol. Biol., 149–158. Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S., 2004. Mismatch string kernels for discriminative protein classification. Bioinformatics 20, 467–476. Liao, L., Noble, W.S., 2003. Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J. Comput. Biol. 10, 857–868. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C., 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540. Oehmen, C., Nieplocha, J., 2006. ScalaBLAST: a scalable implementation of BLAST for high-performance data-intensive bioinformatics analysis. IEEE Trans. Parallel Distr. Syst. 17, 740–748. Rognes, T., Seeberg, E., 2000. Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics 16, 699–706. Saebo, P.E., Andersen, S.M., Myrseth, J., Laerdahl, J.K., Rognes, T., 2005. PARALIGN: rapid and sensitive sequence similarity searches powered by parallel computing technology. Nucleic Acids Res. 33, W535–W539. Smith, T.F., Waterman, M.S., 1981. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197. Webb-Robertson, B.J., Oehmen, C., Matzke, M., 2005. SVM-BALSA: remote homology detection based on Bayesian sequence alignment. Comput. Biol. Chem. 29, 440–443.