ProteinLasso: A Lasso regression approach to protein inference problem in shotgun proteomics

ProteinLasso: A Lasso regression approach to protein inference problem in shotgun proteomics

Computational Biology and Chemistry 43 (2013) 46–54 Contents lists available at SciVerse ScienceDirect Computational Biology and Chemistry journal h...

801KB Sizes 1 Downloads 46 Views

Computational Biology and Chemistry 43 (2013) 46–54

Contents lists available at SciVerse ScienceDirect

Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem

ProteinLasso: A Lasso regression approach to protein inference problem in shotgun proteomics Ting Huang a , Haipeng Gong a , Can Yang b , Zengyou He a,∗ a b

School of Software, Dalian University of Technology, China Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520, USA

a r t i c l e

i n f o

Article history: Received 26 August 2012 Received in revised form 30 December 2012 Accepted 30 December 2012 Keywords: Proteomics Protein inference Lasso Coordinate descent Ensemble learning

a b s t r a c t Protein inference is an important issue in proteomics research. Its main objective is to select a proper subset of candidate proteins that best explain the observed peptides. Although many methods have been proposed for solving this problem, several issues such as peptide degeneracy and one-hit wonders still remain unsolved. Therefore, the accurate identification of proteins that are truly present in the sample continues to be a challenging task. Based on the concept of peptide detectability, we formulate the protein inference problem as a constrained Lasso regression problem, which can be solved very efficiently through a coordinate descent procedure. The new inference algorithm is named as ProteinLasso, which explores an ensemble learning strategy to address the sparsity parameter selection problem in Lasso model. We test the performance of ProteinLasso on three datasets. As shown in the experimental results, ProteinLasso outperforms those state-of-the-art protein inference algorithms in terms of both identification accuracy and running efficiency. In addition, we show that ProteinLasso is stable under different parameter specifications. The source code of our algorithm is available at: http://sourceforge.net/projects/proteinlasso. © 2013 Elsevier Ltd. All rights reserved.

1. Introduction Proteomics is a relatively new but rapidly developing research area whose main objective is to conduct large-scale study of proteins expressed in an organism (Blackstock and Weir, 1999). Mass spectrometry (MS) is a widely used tool for proteomics research due to its sensitivity and to the versatility of the instrumentation (Vitek, 2009). A primary goal of MS-based proteomics is to identify all the proteins expressed in a cell or tissue. In MS-based proteomics, shotgun proteomics has become popular due to its high throughput. Shotgun proteomics identifies complex protein mixtures using a combination of high performance liquid chromatography (LC) and mass spectrometry (MS). A general view for protein identification in shotgun proteomics is illustrated in Fig. 1. Proteins in the sample are first digested into mixtures of peptides with an enzyme such as trypsin and then optionally eluted from their biological source. This digestion procedure is necessary since the sensitivity of the mass spectrometry is much higher for peptides than that for proteins (Vitek, 2009). The eluted peptides are subsequently ionized, fragmented and scanned by tandem mass spectrometry (MS/MS) to produce a set of MS/MS spectra. Finally, peptides and proteins are identified by computational subsequent

∗ Corresponding author. Tel.: +86 411 87571630. E-mail address: [email protected] (Z. He). 1476-9271/$ – see front matter © 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.compbiolchem.2012.12.008

analysis of the acquired MS/MS spectra (Fenyo, 2000; Chakravarti et al., 2002). For computer analysis, two major steps are required: firstly, each MS/MS spectrum is analyzed to determine the underlying amino acid sequence of a peptide; secondly, the identified peptides are grouped together to identify the proteins. For the peptide identification step, typical approaches involve database search approach (Eng et al., 1994) and de novo sequencing (Dancik et al., 1999). Moreover, the database-based approach which compares each observed spectrum against entries in a database to find a best Peptide-Spectrum Match (PSM) is the most popular peptide identification method (Fig. 1). After peptide identification, it is still not straightforward to generate a reliable list of proteins. Fig. 2 shows that one can model the relationship between the identified peptides and their candidate proteins in the database as a bipartite graph. Each vertex Ti represents an identified peptide. The vertex Rj is a candidate protein in the database that matches at least one identified peptide Ti . This is the standard input setting for most protein inference algorithms. The general problem of inferring correct proteins from this bipartite graph is difficult to solve owing to the existence of degenerate peptides and “one-hit wonders”. Degenerate peptides, also called shared peptides, denote the identified peptides shared by multiple proteins in the database. Such cases often result from the existence of homologous proteins in the database and make it difficult to distinguish between two possibilities: (1) all of the related proteins are present in the

T. Huang et al. / Computational Biology and Chemistry 43 (2013) 46–54

47

Fig. 1. A general pipeline for protein identification in shotgun proteomics.

sample; (2) only some proteins are truly present. For example, R1 and R2 in Fig. 2 have the same set of identified peptides: T1 and T2 , which are degenerate peptides. If there is no other supporting information like the peptide T3 , we cannot determine which protein is indeed present in the sample. “One-hit wonders” refer to those proteins which have only one single identified peptide. Theoretically, it is difficult to infer proteins based on one single peptide identification owing to the possibility of false positive identification. For example, the protein R3 in Fig. 2 is a one-hit wonder that has only one matching peptide T4 . Compared with R4 which has two identified peptides, it is not easy to determine the existence of R3 . During the past decade, researchers have developed many protein inference algorithms to tackle above challenges (Tabb et al., 2002; Yang et al., 2004; Zhang et al., 2007; Ma et al., 2009; Slotta et al., 2010; Alves et al., 2007; Nesvizhskii et al., 2003; Price et al., 2007; Feng et al., 2007; Weatherly et al., 2005; Li et al., 2009b, 2010, 2009a; Serang et al., 2010; Moore et al., 2002; Sadygov et al., 2004; Bern and Goldberg, 2008; Shen et al., 2008; Spivak et al., 2012; Searle, 2010; He et al., 2011; Lu et al., 2008; Kearney et al., 2008; Ramakrishnan et al., 2009a,b; Grobei et al., 2009; Qeli and Ahrens, 2010; Gerster et al., 2010). The detailed description and a summary on these methods can be found in some recent review papers (Huang et al., 2012; Li and Radivojac, 2012). Briefly, all these efforts aim at solving the peptide degeneracy problem and onehit wonder problem so as to achieve better identification accuracy. Though they have the same objective, different algorithms model the protein inference problem in quite different ways. On one hand, those methods such as ProteinProphet (Nesvizhskii et al., 2003)

and IDPicker (Zhang et al., 2007; Ma et al., 2009) use standard bipartite graph as input with emphasis on developing rigorous models and algorithms under different assumptions. However, it is still very difficult to distinguish proteins that share exactly the same set of peptides if there is no other supporting information. On the other hand, researchers begin to realize that the employment of some supplementary information such as gene model (Gerster et al., 2010), protein–protein interaction network (Li et al., 2009a) and peptide detectability (Alves et al., 2007) is more promising for resolving the peptide degeneracy problem and one-hit wonder issue completely. Among different types of supplementary information, peptide detectability is an intrinsic property of the peptide that can well explain the assignment of degenerate peptides in principle. Therefore, the objective of this paper is to develop effective protein inference algorithms founded on the concept of peptide detectability. Peptide detectability is defined as the probability of detecting a peptide in a standard sample by a standard proteomics routine if its parent protein is expressed (Tang et al., 2006). Peptide detectability is an intrinsic property of the peptide that is mainly decided by its primary sequence and its parent protein sequence. Peptide detectability indicates that not only identified peptides but also those that are missed (not identified) are important for protein inference. Based on the concept of peptide detectability, Li et al. (2009b) proposed the MSBayesPro algorithm that places particular emphasis on the peptide degeneracy problem. Since MSBayesPro models the protein inference problem as a combinatorial optimization problem, it is almost infeasible to obtain the optimal protein

Fig. 2. The bipartite graph represents the input of a standard protein inference problem. We take only the best spectrum match for each peptide.

48

T. Huang et al. / Computational Biology and Chemistry 43 (2013) 46–54

probabilities within a reasonable time slot. That means there is no guarantee on finding the best protein inference results under its optimization model. More importantly, MSBayesPro is timeconsuming when the datasets are relatively large. In this paper, we formulate the protein inference problem as a constrained Lasso (Tibshirani, 1996) regression problem and then solve it with a fast pathwise coordinate descent algorithm (Friedman et al., 2007, 2010). This new protein inference algorithm is named as ProteinLasso. Lasso is a shrinkage least squares method in statistical learning, which introduces an additional L1 norm penalty term to the least squares objective function so as to achieve sparsity. More precisely, we first show that the probability of each identified peptide can be expressed as the linear combination of protein probabilities, where the coefficients are the conditional peptide probabilities given proteins (i.e., peptide detectabilities). If we take the protein probabilities as unknown variables and assume that peptide probabilities and peptide detectabilities are known in advance, one very natural idea is to formulate the protein inference problem as a constrained least squares regression problem. Here we pose some additional constraints since the probability of each protein should fall into [0, 1]. Next, we further introduce an L1 norm penalty into the model to enforce sparsity since we are interested in finding a subset of proteins that are truly present in the sample. Such modification leads to a constrained Lasso regression problem, which can be solved very efficiently through a coordinate descent algorithm in a pathwise manner. Although the pathwise coordinate descent algorithm is able to quickly solve the constrained Lasso problem for a set of different sparsity penalty parameters, we still need to choose a proper value from this candidate set for model selection. Instead of using Bayesian information criterion (BIC) or cross validation to choose a single parameter, we adopt an ensemble learning strategy in which one protein’s probability is calculated as the average of its probabilities from all models under different sparsity penalty parameters. Such ensemble learning idea can address the parameter estimation problem. Experimental results on three datasets show that our ProteinLasso algorithm outperforms those state-of-the-art protein inference algorithms in terms of both identification accuracy and running efficiency. The rest of this paper is organized as follows. In Section 2, we describe our method in detail. Section 3 presents the experimental results and Section 4 concludes the paper.

The assumption behind m Eq. (1) is that only one protein is present in the sample, i.e. Pr(V = j) = 1. This assumption is not so j=1 strong as it seems since the input bipartite graph for protein inference can be decomposed into many smaller unconnected bipartite graphs and the majority of them contain only one protein. As shown in the experimental section, nearly 90% of these unconnected bipartite graphs have only one protein across all three datasets used in this paper. Based on this fact, we can approximate the protein probability Pr(Rj = 1) with Pr(V = j). Then, Eq. (1) becomes

Pr(Ti = 1) =

m 

Pr(Ti = 1|V = j)Pr(Rj = 1).

In practice, we do not know the true probability values in Eq. (2) and have to obtain their estimations from the data. The estimated value yi for the peptide probability Pr(Ti = 1) is provided by database search algorithms such as Mascot (Perkins et al., 1999) or post-processing tools such as PeptideProphet (Keller et al., 2002). The conditional probability Pr(Ti = 1|V = j) can be approximated with the peptide detectability value dij , which is an intrinsic property of the peptide and can be predicted using models built on training data (Tang et al., 2006). Then our task in protein inference is to find a good estimation xj for the protein probability Pr(Rj = 1). A natural idea is to formulate such inference problem as a least square regression problem with some additional constraints:

minf (X) = min X

X

n  i=1

(yi −

m 

dij xj )2

j=1

(3)

subject to 0 ≤ xj ≤ 1 for all 1 ≤ j ≤ m, where the protein probability vector X = (x1 , x2 , . . ., xm ). The least square formulation assumes that all proteins contribute to the peptide probability, i.e. the probabilities of most proteins should be non-zeros. However, only a small subset of candidate proteins is truly present in the sample. That means most protein probability values should be zeros. Therefore, we add an L1 norm penalty to the least squares objective function so as to shrink some protein probabilities to 0. Such modification leads to a constrained Lasso (Tibshirani, 1996) regression problem:

min X

2. Model and algorithm

(2)

j=1

n  i=1

(yi −

m  j=1

m 

dij xj )2 + 

xj

j=1

(4)

subject to 0 ≤ xj ≤ 1 for all 1 ≤ j ≤ m,

This section formulates the protein inference problem as a constrained Lasso regression problem and then shows that it can be solved efficiently via the coordinate descent algorithm.

where  ≥ 0. Here we can use  to control the number of non-zero protein probabilities.

2.1. Model

2.2. Algorithm

Suppose the bipartite graph contains n identified peptides (T1 , T2 , . . ., Tn ) and m candidate proteins (R1 , R2 , . . ., Rm ) with at least one peptide hit. Let Pr(Ti = 1) denote the probability that the peptide Ti is present and V = j denote the event that only the protein Rj is present in the sample, then Pr(V = j) = Pr(R1 = 0, . . ., Rj = 1, . . ., Rm = 0). According to the sum rule of probability theory, we have:

The constrained Lasso regression problem is a convex optimization problem, whose optimal solution can be found with standard solvers for convex optimization. In addition, there are also some specific algorithms that can solve the Lasso-type problems more efficiently such as LARS (Efron et al., 2004) and the coordinate descent algorithm (Friedman et al., 2007). In this paper, we utilize the pathwise coordinate descent algorithm proposed by Friedman et al. (2007) for its efficiency and simplicity. Note that our formulation is different from the standard Lasso regression model since we have some additional constraints on the range of variables. In this section, we show that we can still use the coordinate descent algorithm to solve such a constrained Lasso regression problem with only minor modifications.

Pr(Ti = 1) =

m  j=1

Pr(Ti = 1, V = j) =

m 

Pr(Ti = 1|V = j)Pr(V = j),

j=1

(1) where Pr(Ti = 1|V = j) denotes the conditional probability that the peptide Ti is present in the sample if given only the protein Rj .

T. Huang et al. / Computational Biology and Chemistry 43 (2013) 46–54

Suppose currently we have estimates on xk for k = / j and we wish to partially optimize with respect to xj , then the objective function in Eq. (4) can be re-written as: f (X) =

n 



(yi −



dik x˜ k − dij xj )2 + 

k= / j

i=1

|˜xk | + |xj |,

(5)

k= / j

where all the variables of xk for k = / j are held fixed at values x˜ k . We can differentiate Eq. (5) to get

  ∂f = (−2)dij (yi − dik x˜ k − dij xj ) +  = 0. ∂xj n

(6)

k= / j

i=1

xj =

n

d (y i=1 ij i

(j)

− y˜ i ) − (1/2))

n

(j)

d2 i=1 ij



,

(7)

d x˜ . According to the constraint 0 ≤ xj ≤ 1, the where y˜ i = k= / j ik k updated value should be calculated as:

x˜ j =

⎧ 0 xj < 0 ⎪ ⎨ ⎪ ⎩

(8)

1

xj > 1

xj

otherwise.

= yi − yˆi + dij x˜ j = ri + dij x˜ j ,

(9)

where yˆi is the calculated probability of the ith peptide based on the current estimated protein probabilities, and hence ri is the current residual between the observed and calculated peptide probability. Thus n 

(j)

dij (yi − y˜ i ) =

n 

i=1

dij ri +

i=1

n 

dij2 x˜ j .

(10)

i=1

As shown in Friedman et al. (2010), further efficiencies can be achieved in computing the updates in Eq. (10) by writing the first term on the right as n 



dij ri = dj , y −

dj , dk ˜xk ,

(11)

k:|˜xk |>0

i=1

n

d y. where dj , y = i=1 ij i Substituting Eqs. (10) and (11) into Eq. (7) leads to xj =

(dj , y −

a minimum value min = εmax , and constructs a sequence of K values of  decreasing from max to min on the log scale. We choose ε = 0.001 and K = 100 in our experiment. To address the model selection problem, we use an ensemble learning strategy to avoid selecting a single  for protein inference. That is, we run the coordinate descent algorithm over the gird of values of  and combine their results to generate a final protein probability:

K

The update Eq. (8) is repeated for j = 1, 2, . . . , m, 1, 2, . . . until convergence. From Eqs. (7) and (8), we can easily understand why the Lasso regression approach can control the size of reported proteins with non-zero probabilities. Making  sufficiently large will cause some protein probabilities to be exactly zero. Looking more closely at Eq. (7), we can see that (j) yi − y˜ i

Since it is impossible to know which  value generates the best solution in advance, the coordinate descent algorithm computes the solution at a gird of values of  (Friedman et al., 2010). It is initialized with the smallest value max for which the protein probability vector X = 0, and then decreased gradually. Each time  is reduced a little and under this  value the coordinate descent algorithm cycles through the variables until convergence. The process is repeated using the previous solution as a “warm start”. Since the minimizers for many of the protein probabilities do not change through the update of  value, this pathwise coordinate descent procedure is very fast. We know from Eq. (12) that xj will be zero if dj , y<0.5. Hence max = 2maxdj , y. The algorithm then selects j

Equivalently, the coordinate-wise update has the form (

49



k:|˜xk |>0

dj , dk ˜xk +

n

d2 i=1 ij

n

d2 x˜ i=1 ij j

− 12 )

.

(12)

Since the peptide detectability dij and the peptide probability yi are provided as the input of ProteinLasso, we can compute and store inner products dj , y and dj , dk  in advance. Thus, the update procedure is very efficient. 2.3. Parameter tuning In this section, we introduce an ensemble learning method to address the model selection problem. Rather than choosing a single  value, we combine the strengths of a collection of single -based models.

xj =

x s=1 js K

,

(13)

where xjs is the estimated probability of protein j at value s . In fact, we have also tried to use BIC for selecting a single  for protein inference. However, our experimental results show that the use of BIC cannot provide very good identification results. 3. Experimental results In order to test the performance of our algorithm, we have compared our method with ProteinProphet (Nesvizhskii et al., 2003) and MSBayesPro (Li et al., 2009b) on three datasets. 3.1. Datasets and experiment set-up We use three datasets in our experiments. Each of these datasets has a reference set that contains the ground-truth proteins. 3.1.1. Mixture of 18 purified proteins The first dataset is a mixture of 18 highly purified proteins from ISB Standard Protein Mix Database (Klimek et al., 2008). Among the 27 datasets provided in the website, we choose one dataset from mix2 acquired on the LTQ instrument. This is because it has more unique peptides and higher protein coverage. We use a database containing 1819 protein sequences for peptide identification. Both the data and database are available at http://regis-web.systemsbiology.net/PublicDatasets/. 3.1.2. Sigma49 dataset Sigma49 is a mixture of 49 human proteins. The database used for peptide identification is composed of 15,682 protein sequences. The data and database are available at https://proteomecommons.org/dataset.jsp?i=71610. 3.1.3. Yeast dataset This dataset has been used by Ramakrishnan et al. (2009a). The raw data is available at http://www.marcottelab.org/users/ MSdata/Data 02/. The reference set is generated by an intersection of 4 MS-based proteomics datasets and 3 non-MS-based datasets. It contains 4265 proteins observed in either two or more MS datasets or any of non-MS datasets and is available at http://www.marcottelab.org/MSdata/gold yeast.html. The database used in the experiment contains 6714 protein sequences, which is available at the same website as the raw data.

50

T. Huang et al. / Computational Biology and Chemistry 43 (2013) 46–54

All tandem mass spectra are searched against the respective protein sequence database with X!Tandem (v2010.10.01.1) search engine. We use default search parameters wherever possible, assuming these parameters have already been optimized. Some important parameter specifications are listed in the following: • • • •

Fragment monoisotopic mass error = 0.4 Da; Parent monoisotopic mass error = 100 ppm; Minimum peaks = 15; Minimum fragment m/z = 150.

The peptide probability is calculated with PeptideProphet embedded in Trans-Proteomic Pipeline (TPP) v4.5. The peptides with PeptideProphet probability >0.05 are considered as candidate peptides, and the proteins with at least one peptide hit are used as candidate proteins. We compare ProteinLasso with ProteinProphet and MSBayesPro. ProteinProphet is the most popular method for protein inference so far. We include MSBayesPro in the performance comparison since it also uses peptide detectability as input. 3.1.4. ProteinProphet ProteinProphet puts proteins that cannot be distinguished with respect to identified peptides into the same group. In performance evaluation, we consider all proteins in each group and use the group probability as the probability of each protein in that group. 3.1.5. MSBayesPro We run MSBayesPro according to the procedure below: 1 Obtain the peptide identifications and their probabilities from PeptideProphet files. If there are more than one spectrum match for a peptide, we choose the highest value as the peptide probability. 2 Obtain the predicted peptide detectabilities from http://darwin. informatics.indiana.edu/applications/PeptideDetectability Predictor/. This software currently only predicts scores for tryptic peptides. However, some identified peptides are nontryptic peptides and hence we have to assign detectability scores to these peptides by ourselves. The detectability value used for these peptides is: median (predicted detectability scores from the same parent protein)/3. 3 Run MSBayesPro for the first time to estimate the protein priors. 4 Run MSBayesPro for the second time to obtain the final protein probabilities with priors from the first run as input. 3.1.6. ProteinLasso Similar to MSBayesPro, ProteinLasso also needs peptide detectability values as input. We adopt the same peptide detectability generation procedure used in MSBayesPro. We set ε = 0.001 and K = 100 in our experiment. According to the peptide identification and peptide detectability files, we can calculate the max value. Then min = εmax . We choose K values of  decreasing from max to min on the log scale. Based on each  value, ProteinLasso outputs a list of protein probabilities and then combines them to generate the final result. 3.2. Results Since all the datasets in our experiment have their reference sets, we label a protein as a true positive if it is present in the corresponding reference set, and as a false positive otherwise. We use receiver operator characteristic (ROC) curve to compare different algorithms, which plots the true positive rate (TPR) as a function of the false positive rate (FPR). The FPR at a probability threshold

t is calculated as FPRt = Ft /Ftotal , where Ft represents the number of false positives with probability ≥ t and Ftotal is the total number of false positives. Likewise, the TPR at a threshold t is the fraction of true positives that have probability ≥ t. We also calculate the area under the ROC curve (AUC) to measure the overall performance of each algorithm. Fig. 3 describes the identification performance of different methods evaluated through ROC curve. The result shows that ProteinLasso outperforms ProteinProphet and MSBayesPro consistently on all three datasets. In particular, the performance of ProteinLasso is much better than that of ProteinProphet and MSBayesPro on the 18 mixtures dataset. One may argue that ProteinLasso only achieves minor performance improvement over ProteinProphet and MSBayesPro in terms of AUC value on another two datasets. Indeed, the AUC value measures the performance over a broad range of probability threshold. As a result, it is a good indicator for the overall performance but may not fully describes the true performance in a high-precision regime. In the evaluation of peptide/protein identification algorithms, researchers are more interested in reporting more true positives at a low false discovery rate (FDR) or q-value. Therefore, we also plot the number of true positives as a function of q-value in Fig. 4 to check the performance gap of different algorithms.1 As shown in Fig. 4, ProteinLasso always outperforms ProteinProphet and MSBayesPro when the q-value is small. This indicates that our method can also achieve very good performance in a high precision regime. In particular, ProteinLasso allows for protein distinction on a fine level, whereas ProteinProphet and MSBayesPro often assign the maximal score of one to many proteins. As a result, ProteinLasso exhibits zero false positives among its H topscoring proteins, where H is 14, 6 and 401 on 18 mixtures, Sigma49 and yeast datasets, respectively. In contrast, ProteinProphet and MSBayesPro cannot produce zero false positives since the probabilities of their top-scoring proteins are all equal to 1. The good performance of ProteinLasso should be attributed to its finegrained ranking of the identified proteins. As discussed in Section 2, the assumption behind our model is that only one protein is present in the sample. This assumption is not so strong as it seems since the input bipartite graph can be decomposed into many smaller connected components. For all three datasets used in our experiment, most of these components have only one protein, as shown in Fig. 5. It indicates that our assumption fits the real data well. Degenerate peptide is one of the major difficulties for protein inference problem. A connected component would not contain any degenerate peptide if it has only one protein. According to this fact, we divide the proteins into two classes: “degenerate proteins” which belong to the component having more than one protein and “simple proteins” which belong to the component having only one protein. To compare the capability of different methods in tackling the peptide degeneracy issue, we present the identification results of three methods when inferring “degenerate proteins” in Table 1. For each dataset, we count the number of true positive and false positive proteins identified by ProteinProphet, MSBayesPro and ProteinLasso with the same number of reported proteins. Then, we count the number of “degenerate proteins” and “simple proteins” among these true positives and false positives, respectively. From Table 1, we have the following observations. On one hand, MSBayesPro can always report the least number of false positive degenerate proteins across all three datasets at the cost of identifying less true positive degenerate proteins. This means that

1 Given a certain probability threshold t, suppose there are Tt true positives and Ft false positives, the FDR is estimated as FDRt = Ft /(Ft + Tt ). The corresponding q-value is the minimal FDR that a protein is reported: qt = mint  ≤t FDRt 

T. Huang et al. / Computational Biology and Chemistry 43 (2013) 46–54

51

Fig. 3. Identification performance comparison among ProteinProphet, MSBayesPro and ProteinLasso on three datasets using ROC curve. The probabilities of top-scoring proteins in ProteinProphet and MSBayesPro are all equal to one and the order of these proteins in the output is random. Thus, we skip these proteins with same probabilities and start from the one with a different probability value when calculating the FPR.

MSBayesPro is more powerful in controlling the false discovery rate with respect to degenerate proteins. On the other hand, ProteinProphet is able to identify more true positive degenerate proteins than MSBayesPro and ProteinLasso but it also reports more false positive degenerate proteins. In contrast to ProteinProphet and MSBayesPro, our method offers a reasonable trade-off between true positive and false positive rates. ProteinLasso requires two parameters: K and ε. We empirically choose the parameters as ε = 0.001 and K = 100. In fact, a different

choice of parameters K and ε has very limited effect on the performance of ProteinLasso. We show this by performing ProteinLasso over two rough grid of K and ε: K ranges from 10 to 100 when ε is fixed at 0.001 and ε ranges from 0.001 to 0.01 when K is fixed at 100. For each pair of parameters, we compute the corresponding AUC value on each dataset. Fig. 6 demonstrates that AUC values fluctuate slightly and thus ProteinLasso is robust to changes in parameters. Table 2 presents the running time of three protein inference methods on three datasets. ProteinLasso runs much faster than the

Table 1 Accuracy on proteins containing degenerate peptides. For the three datasets, we count the number of true positives and false positives identified by ProteinProphet, MSBayesPro and ProteinLasso among their top-k ranked proteins, where k is 31, 43 and 538 for 18 mixtures, Sigma49 and yeast datasets, respectively. The value of k is decided by the number of proteins with probability of 1.0 reported by ProteinProphet. We divide the identified proteins into two classes: “degenerate proteins” are proteins that belong to the components which have more than one proteins and “simple proteins” do not share a peptide with any other protein so that they belong to the components containing only one protein. Methods are abbreviated as PP = ProteinProphet, MSB = MSBayesPro, and PL = ProteinLasso. 18 mixtures PP

Simple proteins Degenerate proteins

Sigma49 MSB

PL

PP

Yeast MSB

PL

PP

TP

FP

TP

FP

TP

FP

TP

FP

TP

FP

TP

FP

17 1

8 5

17 0

11 3

17 0

10 4

27 5

1 10

32 4

6 1

30 6

3 4

TP 354 176

MSB FP 2 6

TP 461 70

PL FP 7 0

TP

FP

427 108

3 0

52

T. Huang et al. / Computational Biology and Chemistry 43 (2013) 46–54

Fig. 4. Identification performance comparison among ProteinProphet, MSBayesPro and ProteinLasso in a high-precision regime. In general, researchers are more interested in the performance of different algorithms when the q-value or FDR is very small, e.g., ≤0.05. However, the probabilities of top-scoring proteins in ProteinProphet and MSBayesPro are all equal to one, we have to ship these proteins with same probabilities and then calculate the q-value at the first appearing protein with a different probability. This process causes that there are no points in the range [0, 0.05] for the curves of ProteinProphet and MSBayesPro on 18 mixtures and Sigma49 datasets and hence we cannot set the q-value range to [0, 0.05] for these two datasets.

Fig. 5. The distribution of components with respect to the number of proteins in each component. The y-axis stands for the percentage of components that contain a certain number of proteins. Most of the connected components have only one protein. In particular, there are no components containing more than three proteins in 18 mixtures and Sigma49 datasets.

T. Huang et al. / Computational Biology and Chemistry 43 (2013) 46–54

53

it is robust to parameter specification so that we can use it with ease. Acknowledgements We thank Dr. Haixu Tang and Dr. Yong Fuga Li for their help on peptide detectability. We thank Dr. Smriti R. Ramakrishnan for providing us the protein reference set of yeast dataset used in the experiments. This work was partially supported by the Natural Science Foundation of China under Grant no. 61003176. References

ε

Fig. 6. Effect of parameters K and ε on ProteinLasso. Table 2 The running time for three methods. Datasets

ProteinLasso

ProteinProphet

MSBayesPro

18 mixtures Sigma49 Yeast dataset

0.567 s 0.555 s 27 s

7s 1s 56 s

28 s 5s 443 s

other two methods. The running time for MSBayesPro measures only the time of step 3: run MSBayesPro for the first time to estimate the protein priors. The efficiency of ProteinLasso is due to the use of the coordinate descent algorithm in problem solving. All the methods are tested on the DELL T5500 workstation with a single 2.26 GHz CPU. 4. Conclusions In this paper, we propose a new algorithm, ProteinLasso, for protein inference in shotgun proteomics. The new algorithm has several advantages over the existing methods: (1) it casts the protein inference problem as a convex optimization problem and hence we can always find the optimal solution; (2) it is extremely fast so that we can use it to handle large-scale proteomics data; (3)

Alves, P., Arnold, R.J., Novotny, M.V., Radivojac, P., Reilly, J.P., Tang, H., 2007. Advancement in protein inference from shotgun proteomics using peptide detectability. In: Pacific Symposium on Biocomputing, pp. 409–420. Bern, M., Goldberg, D., 2008. Improved ranking functions for protein and modification-site identifications. Journal of Computational Biology 15 (7), 705–719. Blackstock, W.P., Weir, M.P., 1999. Proteomics: quantitative and physical mapping of cellular proteins. Trends in Biotechnology 17 (3), 121–127. Chakravarti, D.N., Chakravarti, B., Moutsatsos, I., 2002. Informatic tools for proteome profiling. Biotechniques (Suppl. 32), 4–15. Dancik, V., Addona, T.A., Clauser, K.R., 1999. De novo peptide sequencing via tandem mass spectrometry. Journal of Computational Biology 6 (3–4), 327–342. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regression. The Annals of Statistics 32 (2), 407–499. Eng, J.K., McCormack, A.L., Yates, J.R., 1994. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of American Society for Mass Spectrometry 5 (11), 976–989. Feng, J., Naiman, D.Q., Cooper, B., 2007. Probability model for assessing proteins assembled from peptides sequences inferred from tandem mass spectrometry data. Analytical Chemistry 79 (10), 3901–3911. Fenyo, D., 2000. Identifying the proteome: software tools. Current Opinion in Biotechnology 11 (4), 391–395. Friedman, J., Hastie, T., Hoeing, H., Tibshirani, R., 2007. Pathwise coordinate optimization. The Annals of Applied Statistics 2 (1), 302–332. Friedman, J., Hastie, T., Tibshirani, R., 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 (1), 1–22. Gerster, S., Qeli, E., Ahrens, C.H., Buhlmann, P., 2010. Protein and gene model inference based on statistical modeling in k-partite graphs. Proceedings of the National Academy of Sciences of United States of America 107 (27), 12101–12106. Grobei, M.A., Qeli, E., Brunner, E., Rehrauer, H., Zhang, R., Roschitzki, B., Basler, K., Ahrens, C.H., Grossniklaus, U., 2009. Deterministic protein inference for shotgun proteomics data provides new insights into Arabidopsis pollen development and function. Genome Research 19 (10), 1786–1800. He, Z., Yang, C., Yu, W., 2011. A partial set covering model for protein mixture identification using mass spectrometry data. IEEE/ACM Transactions on Computational Biology and Bioinformatics 8 (2), 368–380. Huang, T., Wang, J., Yu, W., He, Z., 2012. Protein inference: a review. Briefings in Bioinformatics 13 (5), 586–614. Kearney, P., Butler, H., Eng, K., Hugo, P., 2008. Protein identification and peptide expression resolver: harmonizing protein identification with protein expression data. Journal of Proteome Research 7 (1), 234–244. Keller, A., Nesvizhskii, A.I., Kolker, E., Aebersold, R., 2002. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytical Chemistry 74 (20), 5383–5392. Klimek, J., Eddes, J.S., Hohmann, L., 2008. The standard protein mix database: a diverse data set to assist in the production of improved peptide and protein identification software tools. Journal of Proteome Research 7 (1), 96–103. Li, J., Zimmerman, L.J., Park, B.-H., Tabb, D.L., Liebler, D.C., Zhang, B., 2009a. Networkassisted protein identification and data interpretation in shotgun proteomics. Molecular Systems Biology 5, 303. Li, Q., MacCoss, M., Stephens, M., 2010. A nested mixture model for protein identification using mass spectrometry. The Annals of Applied Statistics 4 (2), 962–987. Li, Y.F., Arnold, R.J., Li, Y., Radivojac, P., Sheng, Q., Tang, H., 2009b. A Bayesian approach to protein inference problem in shotgun proteomics. Journal of Computational Biology 16 (8), 1–11. Li, Y.F., Radivojac, P., 2012. Computational approaches to protein inference in shotgun proteomics. BMC Bioinformatics 13 (Suppl. 16), S4. Lu, B., Motoyama, A., Ruse, C., Venable, J., Yates, J.R., 2008. Improving protein identification sensitivity by combining MS and MS/MS information for shotgun proteomics using LTQ-Orbitrap high mass accuracy data. Analytical Chemistry 80 (6), 2018–2025. Ma, Z.-Q., Dasari, S., Chambers, M.C., Litton, M., Sobecki, S.M., Zimmerman, L., Halvey, P.J., Schilling, B., Drake, P.M., Gibson, B.W., Tabb, D.L., 2009. IDPicker 2.0: improved protein assembly with high discrimination peptide identification filtering. Journal of Proteome Research 8 (8), 3872–3881. Moore, R.E., Young, M.K., Lee, T.D., 2002. Qscore: an algorithm for evaluating sequest database search results. Journal of American Society for Mass Spectrometry 13 (4), 378–386.

54

T. Huang et al. / Computational Biology and Chemistry 43 (2013) 46–54

Nesvizhskii, A.I., Keller, A., Kolker, E., Aebersold, R., 2003. A statistical model for identifying proteins by tandem mass spectrometry. Analytical Chemistry 75 (17), 4646–4658. Perkins, D.N., Pappin, D.J., Creasy, D.M., Cottrell, J.S., 1999. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20 (18), 3551–3567. Price, T.S., Lucitt, M.B., Wu, W., Austin, D.J., Pizarro, A., Yocum, A.K., Blair, I.A., FitzGerald, G.A., Grosser, T., 2007. EBP: protein identification using multiple tandem mass spectrometry datasets. Molecular & Cellular Proteomics 6 (3), 527–536. Qeli, E., Ahrens, C.H., 2010. PeptideClassifier for protein inference and targeted quantitative proteomics. Nature Biotechnology 28 (7), 647–650. Ramakrishnan, S.R., Vogel, C., Kwon, T., Penalva, L.O., Marcotte, E.M., Miranker, D.P., 2009a. Mining gene functional networks to improve mass-spectrometry based protein identification. Bioinformatics 25 (22), 2955–2961. Ramakrishnan, S.R., Vogel, C., Prince, J.T., Wang, R., Li, Z., Penalva, L.O., Myers, M., Marcotte, E.M., Miranker, D.P., 2009b. Integrating shotgun proteomics and mRNA expression data to improve protein identification. Bioinformatics 25 (11), 1397–1403. Sadygov, R.G., Liu, H., Yates, J.R., 2004. Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases. Analytical Chemistry 76 (6), 1664–1671. Searle, B.C., 2010. Scaffold: A bioinformatic tool for validating MS/MS-based proteomic studies. Proteomics 10 (6), 1265–1269. Serang, O., MacCoss, M.J., Noble, W.S., 2010. Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data. Journal of Proteome Research 9 (10), 5346–5357. Shen, C., Wang, Z., Shankar, G., Zhang, X., Li, L., 2008. A hierarchical statistical model to assess the confidence of peptides and proteins inferred from tandem mass spectrometry. Bioinformatics 24 (2), 202–208.

Slotta, D.J., McFarland, M.A., Markey, S.P., 2010. MassSieve: panning MS/MS peptide data for proteins. Proteomics 10 (16), 3035–3039. Spivak, M., Weston, J., MacCoss, M.J., Noble, W.S., 2012. Direct maximization of protein identifications from tandem mass spectra. Molecular & Cellular Proteomics 11 (2), M111.012161. Tabb, D.L., McDonald, H., Yates, J.R., 2002. DTASelect and contrast: tools for assembling and comparing protein identifications from shotgun proteomics. Journal of Proteome Research 1 (1), 21–26. Tang, H., Arnold, R.J., Alves, P., Xun, Z., Clemmer, D.E., Novotny, M.V., Reilly, J.P., Radivojac, P., 2006. A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics 22 (14), e481–e488. Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58 (1), 267–288. Vitek, O., 2009. Getting started in computational mass spectrometry-based proteomics. PLoS Computational Biology 5 (5), e1000366. Weatherly, D.B., Atwood, J.A., Minning, T.A., Cavola, C., Tarleton, R.L., Orlando, R., 2005. A heuristic method for assigning a false-discovery rate for protein identifications from mascot database search results. Molecular & Cellular Proteomics 4 (6), 762–772. Yang, X., Dondeti, V., Dezube, R., Maynard, D.M., Geer, L.Y., Epstein, J., Chen, X., Markey, S.P., Kowalak, J.A., 2004. DBParser: web-based software for shotgun proteomic data analyses. Journal of Proteome Research 3 (5), 1002–1008. Zhang, B., Chambers, M.C., Tabb, D.L., 2007. Proteomic parsimony through bipartite graph analysis improves accuracy and transparency. Journal of Proteome Research 6 (9), 3549–3557.