IPM: An integrated protein model for false discovery rate estimation and identification in high-throughput proteomics

IPM: An integrated protein model for false discovery rate estimation and identification in high-throughput proteomics

J O U RN A L OF P R O TE O MI CS 7 5 (2 0 1 1 ) 1 1 6–1 2 1 available at www.sciencedirect.com www.elsevier.com/locate/jprot IPM: An integrated pro...

390KB Sizes 7 Downloads 65 Views

J O U RN A L OF P R O TE O MI CS 7 5 (2 0 1 1 ) 1 1 6–1 2 1

available at www.sciencedirect.com

www.elsevier.com/locate/jprot

IPM: An integrated protein model for false discovery rate estimation and identification in high-throughput proteomics Roger Higdona, b, c,⁎, Lukas Reiterd, e, f, g , Gregory Hathera, b , Winston Haynesa, h , Natali Kolkerb, c , Elizabeth Stewarta, b , Andrew T. Baumana, b , Paola Picottig , Alexander Schmidt g, i , Gerald van Bellek, l , Ruedi Aebersoldg, j, m, n , Eugene Kolker a, b, c, o a

Bioinformatics & High-throughput Analysis Laboratory, Seattle, WA, USA High-throughput Analysis Core, Seattle Children's Research Institute, Seattle, WA, USA c Predictive Analytics, Seattle Children's Hospital, Seattle, WA, USA d Institute of Molecular Biology, University of Zurich, Zurich, Switzerland e Center for Model Organism Proteomes, University of Zurich, Zurich, Switzerland f Ph.D. Program in Molecular Life Sciences Zurich, University of Zurich and ETH Zurich, Zurich, Switzerland g Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland h Hendrix College, Conway, AR, USA i Biozentrum, University of Basel, Basel, Switzerland j Competence Center for Systems Physiology and Metabolic Diseases, Zurich, Switzerland k Department of Biostatistics, University of Washington, Seattle, WA, USA l Deparment of Environmental and Occupational Health Sciences, University of Washington, Seattle, WA, USA m Faculty of Science, University of Zurich, Zurich, Switzerland n Institute for Systems Biology, Seattle, WA, USA o Medical Education and Biomedical Informatics, University of Washington, Seattle, WA, USA b

AR TIC LE I N FO Available online 21 June 2011

ABS TR ACT In high-throughput mass spectrometry proteomics, peptides and proteins are not simply identified as present or not present in a sample, rather the identifications are associated with

Keywords:

differing levels of confidence. The false discovery rate (FDR) has emerged as an accepted means

Protein identification

for measuring the confidence associated with identifications. We have developed the

False discovery rate

Systematic Protein Investigative Research Environment (SPIRE) for the purpose of integrating

Mass spectrometry

the best available proteomics methods. Two successful approaches to estimating the FDR for

Decoy database

MS protein identifications are the MAYU and our current SPIRE methods. We present here a method to combine these two approaches to estimating the FDR for MS protein identifications into an integrated protein model (IPM). We illustrate the high quality performance of this IPM approach through testing on two large publicly available proteomics datasets. MAYU and SPIRE show remarkable consistency in identifying proteins in these datasets. Still, IPM results in a more robust FDR estimation approach and additional identifications, particularly among low abundance proteins. IPM is now implemented as a part of the SPIRE system. © 2011 Published by Elsevier B.V.

Abbreviations: FDR, false discovery rate; FP, false positive; ID, identification; IPM, integrated protein model; LIPS, logistic identification of peptide sequences; MS, mass spectrometry; PSM, peptide spectral match; SPIRE, Systematic Protein Investigative Research Environment; TP, true positive ⁎ Corresponding author at: SCRI, 1900 Ninth Ave, Seattle, WA 98101, USA. Tel.: + 1 206 884 7172; fax: + 1 206 987 7660. E-mail address: [email protected] (R. Higdon). 1874-3919/$ – see front matter © 2011 Published by Elsevier B.V. doi:10.1016/j.jprot.2011.06.003

J O U RN A L OF P R O TE O MI CS 75 ( 20 1 1 ) 1 1 6–1 2 1

1.

Introduction

Proteomics is being accelerated by high-throughput analysis of protein samples using mass spectrometry (MS) [1,2]. Prior to analysis by MS, proteins are typically digested into their peptide components. Search engines such as Sequest, Mascot, X!Tandem, and OMSSA exist to match the spectra generated by tandem mass spectrometry (MS/MS) with peptides from a target protein sequence database [3–6]. Due to the highly complex nature of protein samples and their processing, as well as MS instrumentation, approaches and analysis, peptide spectral matches (PSMs) are associated with varying degrees of uncertainty [7–9]. These uncertainties correspond to the similarities between the MS/MS spectra of the actual peptides and candidate spectra contained in the target database. This is important because many PSMs will have too little similarity to properly infer a relation between the spectra and the peptide. For example, only one out of three PSMs yielded high confidence identifications (IDs) in one recent study [10]. To delineate between strong and weak PSMs, search engines typically generate a score as a relative measure of confidence in an individual spectral match. A standard metric with more obvious external meaning for estimation of confidence (or certainty) of IDs is the false discovery rate (FDR) [11]. The false discovery rate is the likelihood that a positive identification is a false positive (FP, an ID which is said to be true despite actually being false). For example, a 1% FDR indicates that out of 100 IDs, one is expected to be a FP. FDRs for PSMs are often generated using data from a targetdecoy database search where the standard spectral search is performed against a database containing candidate protein sequences (target) and a set of unrelated protein sequences (decoy) [12-18]. The decoy sequences are typically either taken from a pool of unrelated sequences in other organisms or generated by reordering (randomizing) the target protein sequences by reversing or randomly reshuffling them. From this target-decoy search, FDRs can be estimated for PSMs based on the distribution of PSM matches to decoy sequences. An alternative to the target-decoy approach to estimating the FDR for PSMs is the use of mixture models [8]. This approach has the advantage of not requiring a decoy database, but has been shown in certain circumstances to generate poor estimates of the FDR [14,16]. As intact proteins are not identified by MS, spectra cannot be directly associated with a protein. Therefore, combining PSMs into confidence scores for protein IDs is important for determining statistical and biological significance. Unfortunately, the transition from PSM to protein ID raises difficulties due to differing numbers for each protein, varying confidence in each PSM, and the unique properties of each protein [7–9]. Provided that a peptide spectral search has been done against a target-decoy database, various techniques are available to estimate the FDR for protein matches by considering the decoy matches as known FPs. Recent calls by organizations such as the Human Proteome Organization (www.hupo.org) have made it clear that new highthroughput discovery analysis methods are badly needed. This is also evidenced recent discussions amongst experts and researchers in proteomics indicating that reliable estimation

117

of the protein FDRs and the evaluation and integration of best methods are crucial to the success of proteomics [10]. We have developed Systematic Protein Investigative Research Environment (SPIRE) with these objectives in mind [11]. As an example of this approach, this paper focuses on comparing two approaches to generating protein FDRs from the target-decoy database searches, MAYU and our current SPIRE method, and combining them into one integrated protein model (IPM).

2.

Materials and methods

2.1.

MAYU approach

The MAYU approach estimates protein FDRs by adjusting the target-decoy approach. It primarily focuses on large datasets and relies on two key assumptions [19]: 1. FP PSMs will uniformly distribute across the target and decoy protein databases; and 2. The event that a given target protein is identified with at least one TP PSM is independent of the event that the protein is identified with at least one FP PSM. As a result of assumption 1, MS experiments that achieve higher coverage and generate more PSMs will have more FP PSMs, which map to TP proteins. Of particular interest are cases where FP PSMs are associated with target proteins. This can arise in two ways: a) only FP PSMs map to a protein or b) a combination of FP and TP PSMs map to the same protein. From assumptions 1 and 2, Reiter et al. [19] asserted that a protein should not immediately be considered an FP ID in cases where FP PSMs map to that protein. As a result, Reiter et al. [19] developed a hypergeometric distribution of FP protein IDs from which the FDR estimation is derived. The mean of the hypergeometric model is used to generate and estimate of the number of FP protein IDs (fp) as a function of the number of target proteins identified (t), the number of decoy proteins identified (d), and the sizes of the target (T) and decoy (D) databases. This results in the following estimate for number of false positive protein IDs: fp = d  ðT−tÞ = ðD–dÞ

ð1Þ

and thus the FDR can be estimated by FDR = fp = t

ð2Þ

The number of protein IDs, d and t are dependent on the threshold used to determine PSMs and therefore varying the thresholds will alter the FDR and number of protein IDs. The MAYU approach is described in detail in [19].

2.2.

SPIRE approach

The SPIRE approach combines data from PSMs and proteins to generate a protein ID score using logistic regression models that are an extension of our previous protein scoring model [20]. The inputs to the model are based on ID probabilities for PSMs. First, the scores for PSMs are generated by the LIPS (Logistic identification of peptide sequences) model [21]. Second, the

118

J O U RN A L OF P R O TE O MI CS 7 5 (2 0 1 1 ) 1 1 6–1 2 1

probabilities are estimated from an isotonic regression model that utilizes scoring data from decoy database PSMs [22]. Third, a logistic regression model predicts whether the protein sequence is from the target or decoy database based on the six following predictors: (1) protein length, (2) number of unique PSMs by sequence (with at least 90% ID probability), (3) total PSMs (with at least 90% ID probability), (4) maximum PSM probability, (5) sequence length of maximum peptide probability, and (6) sum log (1-p). The predictors are added stepwise based upon maximizing the Bayesian Information Criterion [23]. A five-fold cross validation procedure is used to remove potential bias from over-fitting. The FDR estimate is based upon the number of decoy database matches at a given threshold [24].

2.3.

IPM approach

Both the SPIRE and the FP protein identification adjustment of MAYU provide benefits for making protein ID and determination of the FDR. Combining the MAYU adjustment with the SPIRE protein scoring approach is a relatively straightforward matter and it can be achieved in multiple ways. We built the new IPM approach using the following way. We estimated d from Eq. (1) of the MAYU approach, using thresholds based on the SPIRE protein scoring method, rather than by varying the PSM threshold. Therefore, the combination of these two approaches into IPM is able to draw from the strengths of each individual approach. As a result, IPM creates a better FDR approximation and generates additional protein IDs for a given FDR.

3.

Datasets and sample preparation

Datasets for this work were based on two large proteomics studies of Saccharomyces cerevisiae (yeast), which we refer to as the Aebersold and Gygi [15] studies (datasets). The Gygi dataset was generated using well established MS proteomic methods, two dimensional peptide separation and MS analysis using an LCQ ion trap instrument (Thermo Electron, San Jose, CA). This approach has been described in detail previously [15]. For the Aebersold dataset, S. cerevisiae cells, strain S288C, BY4741 were grown in yeast extract peptone dextrose (YEPD) liquid medium to OD600 ~2 at 30 °C. Pelleted cells were re suspended in an ice cold lysis buffer including 50 mM Hepes, pH 7.5, 5% glycerol, 15 mM dithiothreitol (DTT), 100 mM KCl, 5 mM EDTA, and a protease inhibitor cocktail (Roche, Mannheim, Germany) and disrupted by vortexing in the presence of acid-washed glass beads. Proteins were extracted and digested as previously described [25]. Peptide mixtures were cleaned by Sep-Pak tC18 cartridges (Waters, Milford, MA, USA) and eluted with 60% acetonitrile. Off-gel electrofocusing (OGE) was performed using a pH 3–10 IPG strip (Amersham Biosciences, Otelfingen, Switzerland), and a 3100 OFFGEL Fractionator (Agilent Technologies) with collection into 24 wells. The composition of the separation medium and the OGE operating conditions were as in [23]. Peptides collected in each well were cleaned again by Sep-Pak tC18 cartridges. All peptide samples were evaporated on a vacuum centrifuge to dryness and re solubilized in 0.1% formic acid. Samples were separated

using a an Eksigent nano LC system (Eksigent Technologies, Dublin, CA, USA), connected to a 15 cm fused silica emitter, 75 μm inner diameter (BGB Analytik, Böckten, Switzerland), packed in-house with a Magic C18 AQ 3 μm resin (Michrom BioResources, Auburn, CA, USA).1 ug of phosphopeptides were analyzed per LC-MS/MS run using a linear gradient from 98% solvent A (0.15% formic acid) and 2% solvent B (98% acetonitrile, 2% water, 0.15% formic acid) to 30% solvent B over 90 minutes at a flow rate of 300 nl/min. Mass spectrometric analysis was carried out on a high performance LTQ-FT mass spectrometer equipped with a nanoelectrospray ion source (both from Thermo Electron, Bremen, Germany) as recently described [26]. In brief, each MS1 scan (acquired in the ICR cell) was followed by collision induced dissociation (CID, acquired in the LTQ part) of the three most abundant precursor ions with dynamic exclusion for 20 seconds. Singly charged ions and ions with unassigned charge state were excluded from triggering MS2 events. The normalized collision energy was set to 32%, and one microscan was acquired for each spectrum. Peak lists were generated with Xcalibur version 2.2 SP1. Both of these datasets are available through the Peptide Atlas data repository [27].

4.

Data analysis

For processing and evaluation of the raw MS data, all files were run through SPIRE analysis system using default parameters (http://www.proteinspire.org). SPIRE was run using the X! Tandem search engine (X!Tandem Tornado 2008.12.01.1) [5] against the yeast target-decoy database. Target was the S. cerevisiae database at the National Center for Biotechnology Information of the National Institutes of Health (http://ncbi. nlm.nih.gov/, accessed May 2010, containing 6873 proteins) and a database of known contaminants. Decoy was a set of all the randomized (reshuffled) yeast and contaminant proteins (for details see [16]). X!Tandem was run with the following parameters: fully tryptic cleavage, a static modification of 57.02 Da on Cysteine, a variable modification of 16.0 on Methionine, a mass tolerance of +/−2.5 Da, up to 2 missed cleavages allowed, a maximum valid expectation value of 0.1 (X!Tandem default). The LIPS probability score output of SPIRE [21] and the protein sequence file were used as input for both the MAYU and SPIRE protein scoring methods. Parameters for MAYU were the defaults except that the decoy ratio was set to be the same as SPIRE and protein size binning was turned off in order to make the approach more comparable to and easier to integrate with SPIRE. Figures were created using the R statistical package (http://r-project.org).

5.

Results

The number of protein IDs for both the MAYU and SPIRE approaches as function of the FDR is shown in Table 1, Figs. 1, 2, and 3 demonstrating the similar performance and very high concurrence of these two methods. The Aebersold study was significantly more extensive than the Gygi study, resulting in a

J O U RN A L OF P R O TE O MI CS 75 ( 20 1 1 ) 1 1 6–1 2 1

119

Table 1 – Comparison of peptide spectral matches at 0.5%, 1.0%, 2.0%, and 5.0% FDR. Data are from the Aebersold and Gygi studies using the MAYU, SPIRE and IPM methods. Dataset

Aebersold

Gygi

Approach

MAYU SPIRE IPM MAYU SPIRE IPM

FDR 0.5%

1.0%

2.0%

5.0%

2764 2727 2796 1417 1479 1483

2828 2815 2875 1432 1494 1514

2949 2884 2986 1476 1542 1565

3127 3038 3103 1589 1585 1592

higher coverage of the yeast proteome (see Table 1). To illustrate performance of MAYU and SPIRE, we focus on method comparisons using an FDR of 1%. A 1% FDR is becoming a commonly accepted threshold for asserting the identification of a protein in a sample, especially for large or meta-analysis studies (see e.g. [28]). If a higher threshold is used, such as 5%, then the additional protein IDs often have a very high FDR [22,28]. At 1% FDR MAYU results in 2828 vs. 1432 protein IDs for the Aebersold and Gygi studies, respectively, while SPIRE without the FP protein identification adjustment identified 2815 vs. 1494. Because of the higher coverage in the Aebersold data set, there the MAYU approach's adjustment for the number of FP PSMs on TP proteins should have a stronger effect. This effect is clear in Fig. 1, where the MAYU approach outperforms SPIRE on the Aebersold dataset, while SPIRE excels on the Gygi dataset (Fig. 2). The IPM approach capitalizes on the advantages of both methods, the FDR adjustment of MAYU and the improved discrimination of SPIRE, as can be seen in Table 1, Figs. 1 and 2. All three methods - MAYU, SPIRE and IPM – display remarkable agreement on the majority of proteins in the Aebersold and Gygi studies (Fig. 3). At a 1% FDR the three approaches agree on the identities of 2793 out of 2893 IDs (total number of all identified proteins) in the Aebersold study with

Fig. 1 – Comparison of three FDR approximation approaches for the Aebersold study. The performance of the MAYU, SPIRE and IPM approaches to FDR approximation were compared on the Aebersold dataset. Performance was measured in terms of the number of proteins identified at a given protein FDR estimate. The inset is cropped to focus on the differences between the three approaches at low FDR.

Fig. 2 – Comparison of three FDR approximation approaches for the Gygi study. The performance of the three approaches to FDR approximation was compared on the Gygi dataset. The inset is cropped to focus on the differences between the three approaches at low FDR.

the concurrence of 96.5%. Similarly, for the Gygi study, the agreement was 1399 out of 1533 IDs with the concurrence of 91.3%. It is important to note that the magnitude of these intersections is very rare in high-throughput analyses. It is also worth noting that the validity of MAYU's FDR estimation was previously shown by its agreement with an independent isoelectric point method detailed in [19]. Even though the agreement between the MAYU and SPIRE approaches is striking, each has distinct advantages in certain circumstances. Specifically, MAYU's FDR estimate is less conservative especially in data sets with a high proteome coverage, and SPIRE shows increased discrimination of protein IDs by considering additonal protein specific predictors. Integration of these two approaches into IPM capitalizes upon the benefits of each to yield an even more robust approach. With both datasets, the IPM approach results in improvement over either individual method. At a 1% FDR on the high coverage Aebersold dataset, the IPM approach resulted in an additional 47 proteins over MAYU and 60 protein IDs over SPIRE (see Fig. 3A and Table 1, Supplementary Materials). At a 1% FDR on the Gygi dataset, the IPM approach identified 82 more proteins than MAYU and 20 more proteins than SPIRE (see Fig. 3B and Table 1, Supplemental Materials). Although the improvement in the number of protein IDs by IPM may seem modest, these additional IDs most often correspond to difficult to identify and low abundance proteins. This is a very important point, which is clearly demonstrated by analysis of concentrations of yeast proteins as measured by tandem affinity purification by the number of copies of mRNA [29]. We compared concentrations of 47 proteins identified at 1% FDR by IPM and not MAYU (see Fig. 4) vs. the core of proteins identified by all three methods using side by side boxplots. The distribution of proteins identified by all 3 methods contains more highly expressed proteins than those identified by unique by IPM. Specifically, the proteins identified by all three methods had an average concentration of 14,000 copies per cell in the Aebersold study and 24,000

120

J O U RN A L OF P R O TE O MI CS 7 5 (2 0 1 1 ) 1 1 6–1 2 1

A

B

0

0

SPIRE

SPIRE

22

96

0 2793

Integrated 43

MAYU 17

0 1399

Integrated

18

19

MAYU 4

15

Fig. 3 – Venn diagram of the overlap in protein IDs between three methods at 1% FDR. (A) A comparison of protein IDs for the MAYU, SPIRE and IPM approaches on the Aebersold dataset. (B) The comparison of three approaches on the Gygi dataset.

copies per cell in the Gygi study vs. 2,100 and 3,000 copies per cell, respectively, for the proteins identified by IPM and not MAYU. These low abundance proteins represent a significant portion of candidates for biomarkers and drug discovery targets as well as essential regulators (see e.g. [30]). This IPM approach has been implemented as a part of the SPIRE analysis environment.

3

4

5

6

Gygi Data

2

Log of TAP Expression

A

ALL

MAYU

IPM

MAYU

IPM

3

4

5

6

Aebersold Data

2

Log of TAP Expression

B

MAYU+IPM SPIRE+IPM

ALL

MAYU+IPM SPIRE+IPM

Fig. 4 – Boxplots of log TAP protein concentrations of identified proteins between three methods at 1% FDR. (A) A comparison of concentration distributions for the MAYU, SPIRE and IPM approaches on the Aebersold dataset. (B) The comparison of three approaches on the Gygi dataset. The TAP concentrations are generally much lower for identifications outside the intersection of all three methods.

6.

Discussion

There are intrinsic benefits to both MAYU and SPIRE that make the independent performance of each approach superior in differing scenarios. First, the results of MAYU and SPIRE are remarkably consistent when identifying core, reasonably abundant proteins. Second, the combined IPM approach presented here was able to take advantage of the benefits of each individual approach is expected to consistently outperform either approach in isolation. Applying IPM will result in a higher number of protein identifications at a lower FDR. In this work the results of the X!Tandem searches were processed and evaluated by SPIRE using the LIPS model for assigning PSMs and then were subsequently analyzed by MAYU. However, all of the protein ID approaches implemented in this work are agnostic to a particular search engine or PSM scoring approach and thus can be used with any search engine or scoring method. Additionally, MAYU has a protein size binning option that could further improve FDR estimation [17] and this approach can be adapted to the IPM approach in future. Implementation of IPM will allow high-throughput discovery proteomics studies to achieve a higher degree of confidence in mass spectrometry-based protein IDs. Also IPM will offer increased sensitivity in identifying low abundance proteins, particularly in instances of high proteome coverage. This combined IPM approach has been implemented as part of SPIRE (www.proteinspire.org)[11] and IPM is example of the types of integrated approaches that are the focus of SPIRE platform. Supplementary materials related to this article can be found online at doi:10.1016/j.rtbm.2011.06.006.

Acknowledgements We would like to thank Manfred Claassen, Evelyne Kolker, Arnold Smith, and Charles Smith for their critical reading and

J O U RN A L OF P R O TE O MI CS 75 ( 20 1 1 ) 1 1 6–1 2 1

insightful discussions. The support from NIH (under NIGMS grant 5R01 GM076680-02 and NIDDK grant UO1 DK072473), NSF (under DBI grant 0544757 and ABI grant 07140), and SCRI Internal Funds to E.K., and from the Swiss National Science Foundation (SNF, under grant 31000–10767), the European Research Council (under grant ERC-20089-AdG 233226) and SystemsX.ch to R.A. is greatly appreciated.

REFERENCES [1] Griffin TJ, Goodlett DR, Aebersold R. Advances in proteome analysis by mass spectrometry. Curr Opin Biotechnol 2001;12: 607–12. [2] Aebersold, R. and Mann, M. Mass spectrometry-based proteomics. Nature 2003;422:198–207. [3] Eng JK, McCormack AL, Yates JRIII. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 1994;5:976–89. [4] Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999;20:3551–67. [5] Fenyö D, Beavis RC. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal Chem 2003;75:768–74. [6] Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, et al. Open mass spectrometry search algorithm. J Proteome Res 2004;3:958–64. [7] Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem 2002;74:5383–92. [8] Nesvizhskii AI, Keller A, Kolker E, Aebersold R. A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem 2003;75:4646–58. [9] Kolker E, Higdon R, Hogan JM. Protein identification and expression analysis using mass spectrometry. Trends Microbiol 2006;14:229–35. [10] Alexandre, F. Corraless, J. Cox, et al. Facing challenges in Proteomics today and the coming decade: Report of Roundtable Discussions at the EuPA Scientific Meeting - Estoril 2010. J Prot 2011 [Epub ahead of print]. [11] Higdon, R., Kolker, N., Stewart, E., Welch, D., Bauman, A., Broomail, B., Haynes, W., Kolker, E. SPIRE: Systematic Protein Investigative Research Environment Journal of Proteomics, 2011 [Epub ahead of print]. [12] Elias JE, Haas W, Faherty BK, Gygi SP. Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nat Methods 2005;2:667–75. [13] Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple hypothesis testing. J R Stat Soc 1995;57:289–300.

121

[14] Moore RE, Young MK, Lee TD. Qscore: an algorithm for evaluating SEQUEST database search results. J Am Soc Mass Spectrom 2002;13:378–86. [15] Peng J, Elias JE, Thoreen CC, Licklider LJ, Gygi SP. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J Proteome Res 2003;2:43–50. [16] Higdon R, Hogan JM, Van Belle G, Kolker E. Randomized sequence databases for tandem mass spectrometry peptide and protein identification. OMICS 2005;9:364–79. [17] Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods 2007;4:207–14. [18] Choi H, Ghosh D, Nesvizhskii AI. Statistical validation of peptide identifications in large-scale proteomics using the target-decoy database search strategy and flexible mixture modeling. J Proteome Res 2007;7:286–92. [19] Reiter L, Claassen M, Schrimpf SP, Jovanovic M, Schmidt A, Buhmann JM, et al. Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry. MCP 2009;8:2405–17. [20] Higdon R, Kolker E. A predictive model for identifying proteins by a single peptide match. Bioinformatics 2007;23:277–80. [21] Higdon R, Kolker N, Picone A, van Belle G, Kolker E. LIP index for peptide classification using MS/MS and SEQUEST search via logistic regression. OMICS 2004;8:357–69. [22] Hather G, Higdon R, Bauman A, von Haller PD, Kolker E. Estimating false discovery rates for peptide and protein identification using randomized databases. Proteomics 2010;10:2369–73. [23] Schwarz GE. Estimating the dimension of a model. Ann Stat 1978;6:461–4. [24] Cleveland WS, Devlin SJ. Locally weighted regression: an approach to regression analysis by local fitting. J Am Stat Assoc 1988;83:596–610. [25] Picotti P, Bodenmiller B, Mueller LN, Domon B, Aebersold R. Full dynamic range proteome analysis of S. cerevisiae by targeted proteomics. Cell 2009;138:795–806. [26] Schmidt A, Gehlenborg N, Bodenmiller B, Mueller LN, Campbell D, Mueller M, et al. An integrated, directed mass spectrometric approach for in-depth characterization of complex peptide mixtures. MCP 2008;7:2138–50. [27] King NL, Deutsch EW, Ranish JA, Nesvizhskii AI, Eddes JS, Mallick P, et al. Analysis of the S. cerevisiae proteome with PeptideAtlas. Genome Biol 2006;7:11–8. [28] Higdon R, Haynes W, Kolker E. Meta-analysis for protein identification: a case study on yeast data,". OMICS 2010;14: 309–14. [29] Ghaemmaghami S, Huh WK, Bower K, Howson RW, Belle A, Dephoure N, et al. Global analysis of protein expression in yeast. Nature 2003;425:737–74. [30] Anderson NL, Anderson NG. The human plasma proteome: history, character, and diagnostic prospects. Proteomics 2002;1:845–67.