Informatics for protein identification by mass spectrometry

Informatics for protein identification by mass spectrometry

Methods 35 (2005) 223–236 www.elsevier.com/locate/ymeth Informatics for protein identiWcation by mass spectrometry Richard S. Johnsona,¤, Michael T. ...

949KB Sizes 3 Downloads 155 Views

Methods 35 (2005) 223–236 www.elsevier.com/locate/ymeth

Informatics for protein identiWcation by mass spectrometry Richard S. Johnsona,¤, Michael T. Davisa, J. Alex Taylorb, Scott D. Pattersona a

b

Amgen Corporation, Molecular Sciences, 1201 Amgen Court West, Seattle, WA 98119, USA Amgen Corporation, Bioinformatics Departments, 1201 Amgen Court West, Seattle, WA 98119, USA Accepted 25 August 2004 Available online 13 January 2005

Abstract High throughput protein analysis (i.e., proteomics) Wrst became possible when sensitive peptide mass mapping techniques were developed, thereby allowing for the possibility of identifying and cataloging most 2D gel electrophoresis spots. Shortly thereafter a few groups pioneered the idea of identifying proteins by using peptide tandem mass spectra to search protein sequence databases. Hence, it became possible to identify proteins from very complex mixtures. One drawback to these latter techniques is that it is not entirely straightforward to make matches using tandem mass spectra of peptides that are modiWed or have sequences that diVer slightly from what is present in the sequence database that is being searched. This has been part of the motivation behind automated de novo sequencing programs that attempt to derive a peptide sequence regardless of its presence in a sequence database. The sequence candidates thus generated are then subjected to homology-based database search programs (e.g., BLAST or FASTA). These homology search programs, however, were not developed with mass spectrometry in mind, and it became necessary to make minor modiWcations such that mass spectrometric ambiguities can be taken into account when comparing query and database sequences. Finally, this review will discuss the important issue of validating protein identiWcations. All of the search programs will produce a top ranked answer; however, only the credulous are willing to accept them carte blanche.  2004 Elsevier Inc. All rights reserved. Keywords: Peptide mass Wngerprinting; Tandem mass spectrometry; Peptide sequencing; Protein identiWcation; Database search; De novo sequencing; Bioinformatics; Proteomics; Homology search

1. Introduction There was a time when protein mass spectrometrists could leisurely peruse their data, carefully derive sequences manually, and then move on to acquire additional data to ponder. Those days are gone. Although it is wise to occasionally look at data, it is generally recognized that for many proteomics applications manual interpretation is no longer practical and that the high rate of data acquisition requires an extensive and sophisticated level of computational power and software to automatically make protein identiWcations. This review

*

Corresponding author. Fax: +1 206 217 5529. E-mail address: [email protected] (R.S. Johnson).

1046-2023/$ - see front matter  2004 Elsevier Inc. All rights reserved. doi:10.1016/j.ymeth.2004.08.014

focuses on aspects of automated protein identiWcation using mass spectral data; speciWcally, peptide mass mapping, database searches using tandem mass spectrometry (MS/MS),1 de novo sequencing followed by homology searches, and validation of the protein identiWcations that are made using these computer programs.

1 Abbreviations used: FAB, fast atom bombardment; MALDI, matrix-assisted laser desorption ionization; ESI, electrospray ionization; IT, ion-trap instrument; TOF, time-of-Xight instrument; TQ, triple quadrupole instrument; QTOF, quadrupole/time-of-Xight hybrid instrument; FTMS, fourier transform mass spectrometry instrument; MS, single stage mass spectrometry; MS/MS, tandem mass spectrometry; LC-MS/MS, liquid chromatography coupled on-line to an ESIMS/MS.

224

R.S. Johnson et al. / Methods 35 (2005) 223–236

2. Peptide mass Wngerprinting The Wrst widely used application of mass spectrometry to protein analytical sciences was MALDI mass mapping. This was due to the commercial release of MALDI-MS instruments and the development of algorithms that provided a means of matching the experimentally observed masses to those calculated from the nascent sequence databases of the early 1990s. Mass spectrometry had been used earlier to map peptides derived from proteolytic digests using FAB-MS [1], where it was applied to the conWrmation of translated DNA sequences. Thus, the concept of mass mapping had already been established by 1984. However, it took at least three additional developments (more sensitive ionization methods, increased computer power, plus more extensive sequence databases) before Wve research groups published papers in 1993 describing how mass mapping could be used to identify gel-separated proteins [2–6] (Fig. 1). These groups showed that peptide masses could be calculated for each entry in a protein sequence database (or translation of a nucleotide sequence database) using the expected cleavage speciWcity of the protease employed in the experiment. Experimental masses are then determined by MALDI-MS (or ESI-MS) following proteolytic digestion of isolated proteins (often gel-separated). The observed masses can then be matched against the set of calculated peptide masses, and a score is generated based upon how many of the observed masses matched individual sequence entries. The method was adopted quickly, as it provided a rapid solution to the problem of identifying gel-separated proteins at relatively low levels—something that had hampered the 2D electrophoresis Weld since its inception. However, as the method became more widely adopted for the identiWcation of gel-separated proteins the limitations of the approach became more apparent, as were solutions to some of these problems (i.e., MS/MS database matching, see below).

Fig. 1. Schematic of the peptide mass Wngerprinting process.

Peptide mass Wngerprinting is an appropriate method for protein identiWcation under certain conditions. First, this method should only be used when there is a good chance that the sample to be analyzed contains a puriWed protein (typically, a 2D gel spot). Slightly more complicated mixtures containing two or three proteins might also be reliably analyzed, but the method is really not suitable for anything more complex. Second, the proteins to be identiWed should be from a species that is well represented in sequence databases, since cross-species identiWcations are only going to be successful for those proteins that are approaching 100% sequence identity. If these conditions are met, then the method can be an eYcient manner in which to proceed. A number of search parameters need to be considered. For example, enzymes—including trypsin—do not cleave all substrates to completion, and one parameter that can be manipulated is the number of missed cleavage sites allowed per peptide. Typically, no more than one or two missed cleavages should be permitted. Amino acid modiWcations that are essentially stoichiometric need to be included as a search parameter (sometimes called a “Wxed” modiWcation); for example cysteine residues are often alkylated to assist in proteolytic digestion. The “variable” modiWcations are non-stoichiometric (e.g., phosphorylation). Inclusion of variable modiWcations in a search can be used sparingly for rare amino acids; for example, inclusion of oxidation of methionine (+16 Da) might be suitable, whereas variable phosphorylation of serine, threonine, and tyrosine is not. Problems arise when more frequently occurring amino acids are said to be variably modiWed as many peptides will be aVected, slowing the search time and increasing the number of potential false positives that have the same mass. One should also restrict the database to the species of interest to keep the universe of potential matches small. Some search programs allow the use of intact protein molecular weight as a parameter that limits the search to fewer proteins, which seems to be a bit dangerous, and not recommended, given that in vivo and ex vivo proteolysis can produce proteins at a lower molecular weight diVerent from what would be calculated from the gene sequence. The parameter with the largest aVect would be the peptide mass tolerance (i.e., how accurate are the masses?). Tight tolerances, provided by instruments capable of high mass accuracy and resolution, will greatly reduce the number of false positives [7]. Of course, one also needs to specify whether the peptide masses are monoisotopic or average [8]. At this point in time, all peptide mass Wngerprinting should be carried out on mass spectrometers capable of resolving 13C isotope peaks for peptides below 3500 Da; hence, the mass to charge ratio of the 12C isotope peaks should be used in a search with the monoisotopic variable selected. Most of the mass mapping search engines provide a statistical metric to aid in validating the protein identiWca-

R.S. Johnson et al. / Methods 35 (2005) 223–236

225

tion, which is usually an estimate of the probability that the match is random (based on the number of matching peptides, number of unmatched ions in the spectrum, the sequence coverage, mass accuracy, etc.).

3. MS/MS database matching For a few years following the introduction of commercially available MALDI and ESI mass spectrometers, MS/MS spectra were generated by inspection of the MS ion signals and manually setting the instrument to fragment each precursor ion. Given this low rate of data acquisition, manual inspection of the resulting spectra was feasible, and Mann and Wilm [9] introduced a method that was a hybrid of manual de novo interpretation and database searching that employed what they termed a “peptide sequence tag”. In this approach, a partial sequence is determined from a series of fragment ions whose mass diVerences matched amino acid residue masses. For instruments containing quadrupole collision cells (TQ or QTOF), one can often “read oV” a short sequence from a limited series of y-ions at mass to charge ratios greater than the tryptic peptide precursor. Combined with the unsequenced masses at the N- and C-terminal side of the sequence tag, plus the proteolytic enzyme speciWcity, a sequence database search can be performed. Over time, however, instrument control software capable of automatically selecting and fragmenting peptides was developed (so-called “data-dependent” analysis) [10]. This provided the ability to collect thousands of MS/MS spectra every hour, thereby swamping interpretation methods that relied on any sort of manual interpretation. Eventually automated peptide sequence tag programs were developed [11], but in the meantime alternative software solutions were developed for searching sequence databases using uninterpreted MS/MS spectra. There are a number of programs that match uninterpreted MS/MS spectra to sequence databases [7,12,13], including some vendor-supplied software, but the Wrst of these was SEQUEST [14]. The NIH is developing their own open source program [15], and the Manitoba Centre for Proteomics hosts a website that distributes an open source search program (http://www.proteome.ca/ opensource.html) [16]. All of these programs operate with a similar premise, which is that the mass of the peptide is known, the masses of the fragment ions are known (although their ion type is unknown), and the putative speciWcity of the enzyme used to generate the peptides is known. These three pieces of information can be used to search protein sequence databases or translations of nucleotide sequence databases. The general procedure (Fig. 2) involves a peptide mass pre-Wlter, where all peptides in a sequence database that match the peptide molecular weight (within a speciWed tolerance) are

Fig. 2. Schematic of the uninterpreted MS/MS spectrum search identiWcation process.

identiWed. A further optional constraint is that these database-derived sequences may be required to match the consensus cleavage of the protease that was used to produce them. Given such a list of sequences, mock tandem mass spectra are derived using known peptide fragmentation principles (for a review, see [17] as well as the Wysocki review contained within this issue), which are then compared to the experimental spectrum. Scores are generated that reXect how well each mock spectrum matches the real one, and the sequences are ranked accordingly. The biggest diVerence between software packages relates to the variations in how these scores are derived. Unlike peptide mass Wngerprinting, very complex samples can be analyzed, since each peptide is independently matched to a database sequence. Hence, the method is appropriate for the analysis of 2D and 1D gel samples, as well as “no gel” samples (e.g., cell lysates). However, MS/MS database matching still works best for peptides that exactly match the correct sequence in a database. IdentiWcation of homologous proteins is diYcult using MS/MS database matching unless there is a substantial percent identity, such that a few of the peptides are exact matches. The parameters described for peptide mass Wngerprinting also apply to MS/MS searches, with one additional parameter—fragment ion mass tolerance. Due to the higher information content of most MS/MS spectra, it has been suggested that the enzyme constraints be removed. However, this is not ideal as the number of peptides that match the observed mass will become very large, which increases the search times and, more importantly, increases the number of false positives. In reality, there are few peptides in a proteolytic digest that do not have at least one correct terminus; therefore, if a program provides the ability to have one terminus conform to the enzyme cleavage speciWcity then use of that option might allow for the identiWcation of a few more peptides.

226

R.S. Johnson et al. / Methods 35 (2005) 223–236

But as for any of these variables, it is preferable to try searches with the most conservative parameters (e.g., completely tryptic termini, zero or one uncleaved trypsin site per peptide, no variable modiWcations, tight precursor, and fragment tolerances) to start and then re-search using more relaxed conditions, in order to see if any additional plausible matches can be made. For various reasons, the data-dependent MS/MS acquisition process can in some cases lead to larger than expected errors in the precursor ion measurement (see below). Furthermore, for multiply charged precursors produced by ESI; the peptide molecular weight error must be propagated. Hence, precursor tolerance should be a bit larger than what would be used in a peptide mass Wngerprint search.

4. Databases There are two approaches towards choosing a database. A large, comprehensive, multi-species database will ensure that as many candidates as possible are considered. On the other hand, a smaller database that is species-speciWc will produce fewer false positives. For a comprehensive multi-species non-redundant protein database, the two most common choices are the MSDB database from the European Bioinformatics Institute (EBI), and the “nr” database from the National Center for Biotechnology Information (NCBI) (Table 1). The “nr” database is more comprehensive but lacks the extensive annotation provided with MSDB. EMBL-EBI provides some nicely curated, non-redundant human, mouse, and rat species-speciWc databases (IPI) that combine protein records from Swiss-Prot, TrEMBL, RefSeq, and Ensembl [18]. For screening against patented sequences, the NCBI provides “pataa,” which is a database of protein sequences extracted from patents. Faster and more complete access to patented sequences can be obtained commercially from Thomson Derwent’s GENESEQ database. Many search tools allow for the use of nucleotide databases as well as protein databases. While the searching Table 1 Sequence databases Database

Characteristics

url

MSDB

Multi-species

nr

Multi-species

ipi

Human, rat, mouse

pataa

Patent-derived

GENESEQ

Patent-derived

UniGene

EST clusters

ftp://ftp.ebi.ac.uk/pub/databases/ MassSpecDB/ ftp://ftp.ncbi.nih.gov/blast/db/ FASTA/nr.gz ftp://ftp.ebi.ac.uk/pub/databases/ IPI/current/ ftp://ftp.ncbi.nih.gov/blast/db/ FASTA/pataa.gz http://thomsonderwent.com/ products/patentresearch/geneseq/ ftp://ftp.ncbi.nih.gov/repository/ UniGene/

of nucleotide databases tends to be less common due to the longer search times needed for searching multiple frame translations and the higher false positive rate, it can sometimes be useful to search nucleotide databases when searches against protein databases are not fruitful. The enormous size and redundancy of the EST database makes UniGene, which is built by clustering ESTs and mRNA sequences, a more attractive alternative. Whole genome sequences, when available, can also be used for searching [19].

5. Pre-processing of MS/MS spectra To perform an MS/MS database search or de novo sequence determination, the tandem mass spectral data have to be extracted from the raw binary (and proprietary) LC-MS/MS Wles. The mass spectrometry vendor’s preference would be that users buy into their complete proteomics “solutions,” in which case, there is no need to extract MS/MS spectra as individual text Wles. For those who wish to use third party software, one has to worry about this issue. Fortunately, most vendors also supply rudimentary methods for extracting data from their binary Wles. One commonly used format is the so-called “dta” Wle format that originated with the Sequest program. It is simply a text Wle containing mass to charge ratios and intensity pairs (separated by a space or tab), and contains a single line header that holds the protonated mass of the peptide and the precursor ion’s charge (also separated by a space). Why the header contains the protonated peptide mass is a bit of a mystery, but probably has historical roots, in that FAB ionization primarily produces (M + H)+ peptide ions. The dta Wles may contain any number of m/z and intensity pairs, and thus may represent either a raw spectrum or a derived list of peaks. Reducing a raw spectrum to a list of signiWcant peaks can speed and guide the search process if done correctly. Some search algorithms attempt such spectrum reduction as an initial step while others simply work with whatever they are given. Now there are also separate pre-processing components available (e.g., “Distiller” from Matrix Sciences, London, UK) that can be used prior to submitting data to a search. There is some interest in developing computer methods to winnow out bad spectra prior to any further computer processing [20]. The impetus, of course, is to reduce the overall computer processing time. However, given that a large fraction of spectra within an LC-MS/MS data set is of reasonable quality (say, 10–20%), then a perfect algorithm for winnowing “bad” spectra might allow a 5-fold increase in processing speed. For some, it might be more cost eVective to simply buy Wve times as many computers, which would also eliminate the risk of removing a “good” spectrum by accident. This is not to say that it is not worthwhile to remove obviously bad

R.S. Johnson et al. / Methods 35 (2005) 223–236

spectra—those containing only a few ions, Wles where all the fragment ions are within very narrow mass windows, or Wles for unusually low mass (i.e., short) peptides. One other aspect of spectrum pre-processing prior to a database search is the identiWcation of similar spectra within an LC-MS/MS data set or within a larger MS/MS spectral database. A cross-correlation method has been used to perform library searches of peptide MS/MS spectra [21], but dot product comparisons are not as computationally intensive [22–24]. Regardless of the method used to identify similar spectra, a critical question is what to do with this information. Originally, it was proposed that new data could be compared to a library of peptide MS/MS spectra in order to quickly identify common proteins that are seen repeatedly (e.g., trypsin autolysis or keratin peptides) [21]. Depending on how the LC-MS/MS is operated, it is possible to repeatedly acquire MS/MS spectra of the same peptide within the same experiment. These groups of similar spectra can either be searched individually, but validated as a group [23], or combined into a single composite spectrum prior to a search [24]. The former would seem to be the safest procedure; however, the latter claims to result in improved database search scores due to improved signal to noise ratios resulting from the averaging of several MS/MS scans.

6. Why do not all MS/MS spectra match to a sequence entry? Novices should not be alarmed by the fact that a large fraction (usually a majority) of the acquired spectra are not matched to a sequence entry. Of course, when working with samples derived from organisms not well represented in any sequence database, few matches can be expected, since exact matches are generally required (Fig. 2). Even a conserved substitution of valine for isoleucine, resulting in a 14 Da mass error, is suYcient to exclude a very homologous sequence from further consideration. However, as more genomes get sequenced this becomes less of a problem than it was in the past. Nevertheless, even for well-sequenced organisms, sequence polymorphisms can still cause diYculties, since these are sometimes indicated only as annotations, rather than as separate sequence entries. Also, recombinant protein expression is sometimes performed using cell lines derived from organisms whose genomes are not sequenced (e.g., Chinese hamsters), and contaminating proteins derived from the host cell may be diYcult to identify using standard MS/MS database search programs. Likewise, contamination of a cell culture (e.g., from an unsequenced mycoplasma species) can result in the presence of unknown proteins. Peptide modiWcations can occur in vivo (e.g., phosphorylation) or ex vivo (e.g., methionine oxidation), but

227

in either case the observed peptide mass will be diVerent from what would be calculated from a sequence database. Since the peptide molecular weight is used as the Wlter to derive candidate sequences, an incorrect molecular weight will only provide incorrect sequences [25]. The Association of Biomolecular Resource Facilities hosts a website tool called Delta Mass (http://www.abrf.org/ index.cfm/dm.home) that provides an extensive list of mass changes associated with various modiWcations. Unimod (www.unimod.org) is a similar listing. However, unless the MS/MS database search program is informed of the speciWc modiWcation under study, the match will not be made. A third reason why a spectrum might not match a sequence entry is that the precursor ion was not derived from an intact peptide that conforms to the anticipated cleavage speciWcity of the enzyme that was used. For example, database searches of tryptic peptides are usually performed with the constraint that all candidate sequences not only match the observed peptide mass, but also are derived from cleavages C-terminal to lysine and arginine. If a peptide was derived from cleavage by another proteolytic activity that was present in the sample before or after protein isolation, then the correct candidate sequence will not be scored and ranked. Many proteins are proteolytically processed (e.g., removal of N-terminal signal peptides); however, database entries usually contain the complete unprocessed sequence. Hence, database searches requiring two tryptic cleavages will usually not Wnd the N-terminal peptide. On occasion, puriWed enzymes will sometimes clip at atypical sites. For example, Staphylococcus aureus (strain V8) protease (sometimes called Glu-C because it is supposed to cut on the C-terminal side of Asp and Glu) readily cleaved a Gly-Ile bond within thioredoxin isolated from Chromatium vinosum [26]. A very similar, and more frequent, problem arises due to fragmentation of intact peptide ions in the ion source prior to the Wrst stage of mass analysis. These in-source fragment ions can trigger a data-dependent MS/MS spectrum, but will not be identiWed when searching for peptides derived solely from, for example, tryptic cleavages. In-source fragmentation can be seen in the MS spectrum of a fetuin peptide (Fig. 3A), where the fragment ions were of suYcient abundance to trigger data-dependent acquisition of their MS/MS spectra regardless of the type of instrument. Most common are in-source fragment ions generated from very labile bonds, such as amide bonds N-terminal to proline. In-source fragmentation resulting in neutral loss of water or ammonia can also occur, as shown for a peptide from apolipoprotein E (Fig. 3B). Either by insource fragmentation (Fig. 3) or by unexpected proteolytic cleavage, MS/MS spectra can be acquired for precursor ions of unexpected mass. However, as mentioned earlier it is unlikely that both termini are spuriously clipped, and searches that allow this will be

228

R.S. Johnson et al. / Methods 35 (2005) 223–236

Fig. 3. In-source fragmentation can cause selection of unanticipated ions for MS/MS. (A) In-source fragmentation can occur irrespective of ion source. A tryptic peptide of human fetuin fragments within the ion source at the three bonds indicated on the sequence. The intact peptide ions are labeled as (M + xH)xx in the MS spectra, where x is the number of protons. The MS spectrum of the peptide acquired on IT and QTOF mass spectrometers are shown. (B) Upper MS spectrum is of apolipoprotein E peptide (WELALGR, molecular weight 843.4), which shows an in-source neutral loss of ammonia (m/z 414.2). Data-dependent scanning triggered acquisition of MS/MS spectra of both precursors. The neutral loss occurred from the N-terminal tryptophan. The spectra were acquired on a QTOF mass spectrometer (Micromass, UK).

R.S. Johnson et al. / Methods 35 (2005) 223–236

unnecessarily slow, as well as error-prone due to the increase in false positives resulting from the larger number of peptides that get scored. A fourth common problem is that intact peptide masses are not calculated correctly. If the charge state of the precursor ion is not determined correctly or if an isotope peak containing 13C is erroneously chosen as the monoisotopic 12C peak, the calculated mass of the intact peptide will be incorrect. Since the peptide molecular weight is used as the Wlter to derive candidate sequences, an incorrect molecular weight will only provide incorrect sequence candidates. If the ion that triggered a data-dependent MS/MS spectrum was not the monoisotopic 12C ion [8] but contained one or more 13C atoms, the intact mass may be outside of the assigned peptide mass tolerance (e.g., Fig. 4). Search engines expect protonated peptide ions, but dirty HPLC solvents might contain suYcient concentrations of alkali metal cations to produce, for example, sodiated peptide ions. Not only will this result in an unexpected peptide mass, but also the presence of a sodium ion signiWcantly changes the fragmentation pattern compared to protonated peptide ions.

229

For complex samples it is not uncommon to acquire MS/MS spectra containing fragment ions from multiple precursor ions that are close in mass. QTOF and IT instruments are typically operated to allow passage of a 3 or 4 Da window surrounding the precursor ion that was selected by the data system during a data-dependent LC-MS/MS experiment. If other ions of similar mass coelute, then the MS/MS spectrum will contain additional ions that are unrelated to the ion that triggered the scan. Hence, database search programs will not be able to correlate these additional unrelated ions with any single peptide. The most likely explanation for a spectrum not matching a sequence database entry is that it is of insuYcient quality or the MS/MS spectrum was obtained from non-peptide contamination. If data are collected using dirty solvents or a contaminated column, there will be a number of MS/MS spectra acquired for non-peptide precursor ions. Also, mass spectrometers are sometimes operated with the data-dependent trigger threshold set so low that MS/MS spectra are collected for spurious noise peaks in the MS scan. At times, the data-dependent trigger may occur early in a peptide’s elution when the

Fig. 4. Data dependent ion selection with dynamic exclusion can result in precursor isolation of a skewed isotope distribution and incorrect mass assignments. The upper MS/MS spectrum was triggered by the 12C ion (shown in inset as 800.4) and the precursor ion selection window transmitted all of the isotope peaks at higher mass. Following acquisition of this spectrum, the data system excluded acquisition of other precursors within 1.3 Da of 800.4. However, the isotope peak at 801.74 was just outside this exclusion zone, and an MS/MS spectrum was triggered by this fourth 13C isotopic peak (see inset in upper panel). However, the precursor selection window was biased against transmitting ions below the triggering mass, and the peaks transmitted (inset, lower panel) all contained two or more 13C atoms. Note the apparent increase in mass of the higher mass fragment ions by one mass unit (at m/z >324), which is due to the fact that higher molecular weight fragments are more likely to contain at least one of the 13C atoms. The bottom spectrum will have an incorrect peptide molecular weight, plus the higher mass fragment ion masses will be skewed. The spectra were acquired on a QTOF.

230

R.S. Johnson et al. / Methods 35 (2005) 223–236

precursor ion intensity is low, which could result in a poor quality spectrum. Many such spectra cannot be matched simply because of their poor quality. Upon discovering that a large number of spectra cannot be identiWed, there may be a temptation for those who are innocent of the nuances described above to jump to the conclusion that their sample contains a new and novel protein with numerous and unusual modiWcations. In fact, the aforementioned mundane reasons are far more likely. In our opinion, the best way to deal with these diYculties is to perform a database search using the simplest criteria—strict tryptic cleavage rules, one variable modiWcation, and a single missed tryptic cleavage site. The unmatched spectra can be dealt with by performing second or third searches allowing wider mass tolerances or no enzyme speciWcity. In addition, a database search can be done using a large number of variable modiWcations, single point mutations, and nonconsensus protease cleavage. As already noted, using the usual non-redundant protein sequence database will require considerable processing time and produce large numbers of false positives. However, if the search is limited to those proteins already identiWed in the Wrst stringent search, then this subsequent search can be relatively quick and the resulting matches more believable [27]. A completely diVerent approach to interpreting unmatched spectra is to attempt to derive a sequence de novo.

7. De novo sequencing and homology searches De novo sequencing is the term used for the process of deriving peptide sequences from MS/MS spectra without using a sequence database. One can manually perform this task (for a tutorial, see www.abrf.org/ ResearchGroups/MassSpectrometry/EPosters/ms97 quiz/abrfQuiz.html), or use software to automate this task. Automated de novo sequencing has a history going back at least to the mid-1960s when a room Wlled with an IBM 7094 computer was used to sequence polyamino alcohol peptide derivatives [28]. Since then, computers have sped up some, plus it is no longer necessary to derivatize peptides in order to ionize and volatilize them. Early programs were written to sequence peptides using data obtained from chemical ionization MS [29], FABMS [30,31], and FAB-MS/MS [32–34]. At this point in time, most MALDI-MS and ESI-MS instrument vendors oVer de novo sequencing programs as part of their software packages. Other commercially available programs are vendor-independent [35,36], and one is open source [37]. The details of how these programs work are all diVerent, and not particularly relevant for this review. SuYce it to say that they all attempt to derive partial or complete peptide sequences without recourse to a sequence database.

De novo sequencing is a fairly diYcult and errorprone process that typically produces ambiguous results (Table 2). The Wrst three problems enumerated in Table 2 relate to the diYculties of diVerentiating between amino acids that have identical, or nearly identical, masses. Determining leucine and isoleucine by mass spectrometry is not possible using low energy collision induced dissociation (CID) methods used with most of the commonly used instruments (ion traps and quadrupole collision cells); however, at higher collision energy (>500 eV) side chain fragmentations occur that permit this distinction [38]. The second and third problems listed in Table 2 (i.e., certain amino acids are very close in mass) can be overcome if the mass analyzer is capable of suYciently high mass accuracy and resolution (TOF and FTMS). To derive a complete sequence by MS/MS requires that a fragment ion be observed for each peptide bond, and the fact that this rarely happens (problem 4 in Table 2) is a major reason for the diYculty behind de novo sequencing. If it were known at the outset that an MS/ MS spectrum exhibited a complete series of ions separated by single amino acid residue masses, deriving a sequence would be trivial. Sometimes this is the case (Fig. 5A), and sometimes not (Fig. 5B), and there are several reasons that account for the lack of fragment ions. A poor quality spectrum may contain fragment ion signals that drop below the measurement noise level. Cleavage on the C-terminal side of proline has long been known to be an unfavorable reaction, and is often not observed. For low energy CID the primary sequencespeciWc fragment ions are the y- and b-type ions, which are most abundant when there is a mobile proton to induce cleavage of individual peptide bonds [39]. The presence of a mobile proton is a likely contributor to the Table 2 Why de novo sequencing is diYcult 1. 2. 3. 4.

Leucine and isoleucine have the same mass Glutamine and lysine diVer in mass by 0.036 Da Phenylalanine and oxidized methionine diVer in mass by 0.033 Da Cleavages do not occur at every peptide bond a. Poor quality spectrum (some fragment ions are below noise level) b. The C-terminal side of proline is often resistant to cleavage c. Absence of mobile protons d. Peptides with free N-termini often lack fragmentation between the Wrst and second amino acids 5. Certain amino acids have the same mass as pairs of other amino acids a. Gly + Gly (114.0429) Asn (114.0429) b. Ala + Gly (128.0586) Gln (128.0586) c. Ala + Gly (128.0586) Lys (128.0950) d. Gly + Val (156.0742) Arg (156.1011) e. Ala + Asp (186.0641) Trp (186.0793) f. Ser + Val (186.1005) Trp (186.0793) 6. Directionality of an ion series is not always known (are they b- or y-ions?) Numbers in parentheses are residue masses.

R.S. Johnson et al. / Methods 35 (2005) 223–236

231

Fig. 5. The position and number of basic residues can aVect spectrum quality. (A) MS/MS spectrum of a tryptic peptide ion containing one arginine residue and two protons. The mobile proton produces an easily sequenced contiguous series of y-type ions. (B) MS/MS spectrum of a non-tryptic peptide containing two arginine residues and two protons. The lack of a mobile proton makes this spectrum diYcult to interpret. The spectra were acquired on a Q-TOF mass spectrometer (Micromass, UK).

formation of a complete y-ion series in Fig. 5A; in contrast, Fig. 5B depicts an MS/MS spectrum of a peptide ion where the two arginine residues likely sequestered the two ionizing protons, leaving no mobile proton. The resulting fragmentation pattern is a noncontiguous mixture of neutral losses of ammonia from a, b-, and y-ions, plus some rearrangements resulting in “b + 18” ions [40]. Finally, one rarely Wnds fragment ions that allow for the sequencing of the two N-terminal amino acids from peptides that have free N-terminal amino groups. To make it clear that sequencing evidence is absent, some programs, e.g., LuteWsk [37] will denote these unfragmented regions by using a bracketed mass

value within a sequence. For example, the notation “QTNVL[200]YGGLTN” (Fig. 6) indicates that the Nterminal sequence QTNVL is followed by two or more amino acids in the middle of the peptide whose combined residue mass is 200 Da, which is followed by the sequence YGGLTN at the C-terminus. Problem 5 in Table 2 lists the pairs of amino acids that have identical, or nearly identical, masses of certain single residues. If a peptide actually contained the subsequence Gly-Gly, but there were no fragment ions present to make this distinction, one would reasonably (but erroneously) conclude that there was an Asn at that position. Conversely, if the peptide actually contained an

232

R.S. Johnson et al. / Methods 35 (2005) 223–236

Fig. 6. CIDentify is able to align these query and database sequences, because of mass equivalencies. Gln in the query sequence has the same mass as the combined mass of Gly and Ala in the database sequence. Leucine and isoleucine are indistinguishable, and are counted as exact matches. The unsequenced mass of 200 Da in the query sequence matches the combined mass of Thr and Val in the database sequence. The combined mass of two Gly in the query equals that of a single Asn in the database. The dipeptide Thr-Asn has the same mass as Ser-Lys.

Asn within its sequence, but an unrelated ion happened to be at a mass corresponding to cleavage of a Gly-Gly subsequence then the opposite error would be made. There are no good ways to resolve this except to account for these possibilities later on when the de novo sequencing results are used for homology-based database searches (see below). The Wnal problem listed in Table 2 relates to the fact that it is usually not known whether an ion contains the C- or N-terminus (i.e., whether the sequence is being read oV from a b- or y-type ion series). Hence, it is possible to derive a sequence that is mostly correct, but backwards. In the absence of proteolytic 18O isotopic labeling of the C-terminus using 50% H218O [41], which labels y-type ions as doublets separated by 2 Da, one can only attempt to connect an ion series to one of the termini. For example, tryptic peptides usually contain a C-terminal lysine or arginine, which produce y1 ions at m/z 147 or 175, respectively. If a series includes this ion then it is usually safe to conclude that it is a y-type ion series. If this cannot be done then one simply has to guess. So why bother with manual or automated de novo sequencing when the aforementioned robust database search engines are available to interpret peptide MS/MS spectra? First, high quality spectra can be rapidly identiWed and sorted away from the junk spectra. It stands to reason that if an automated de novo sequencing program can derive a high scoring sequence (i.e., one that accounts for most of the fragment ions and has a contiguous series of ions deWning an amino acid sequence), then that spectrum contains useful data, whether or not a database match can be made. Second, a de novo result can be used to help validate a database search result [42]. These two approaches (database search and de novo sequencing) are so diVerent that any similarity in the results could be taken as evidence that the database-derived sequence is correct. Third, de novo sequencing results can be used for homology-based searches [37], which can help make cross-species identiWcations [43] or, in some cases, identify peptide modiWcations [44,45]. In short, de novo sequencing can help validate database search results, as well as sort through the myriad of unmatched spectra.

Use of de novo sequencing results for homologybased database searches is not as straightforward as, for example, using Edman sequencing results. As noted in Table 2, there is a high probability that any sequence derived from MS/MS spectra will contain some ambiguity. Furthermore, there is the likelihood of being able to derive several high scoring sequence candidates from a single MS/MS spectrum. Some of this ambiguity is easily resolved by altering the matrix scoring tables to reXect the fact that for particular MS/MS data there is no chance of diVerentiating Leu/Ile or Gln/Lys. For example, the average score for Gln identity and Lys identity used in a standard BLOSUM [46] or PAM [47] matrix would replace the Gln versus Lys homology score. The standard homology search programs, BLAST [48] and FASTA [49], are not capable of eVectively handling other tandem mass spectrometric ambiguities (Fig. 6). Consequently, the available FASTA code was altered to accommodate some of the issues outlined in Table 2 and Fig. 6. First, each candidate sequence (typically limited to Wve or less) is compared to each database sequence entry. The best initial alignment between query and database sequence is further evaluated to see if any dipeptide masses match a single amino acid mass, a bracketed mass in the query corresponds to the combined mass of amino acids in the database sequence, or if any dipeptide in the query has the same mass as a dipeptide in the database sequence. This modiWcation of FASTA was called CIDentify and the source code is available from the University of Virginia FASTA ftp site (http://ftp.virginia.edu/pub/fasta/). A similarly “massinformed” homology search program is the program OpenSea [45]. Other groups have utilized the BLAST program, unmodiWed for mass ambiguities, to analyze mass spectral de novo sequencing results, by adjusting certain parameters and the matrix scoring tables. Proteins associated with the 20 S proteasome from an unsequenced trypanosome species were separated by 2D gel electrophoresis, and the peptides from each isolated spot were sequenced by tandem mass spectrometry [50]. Convincing identiWcations were made by taking the sequences (total number D n) obtained from each spot (assumed to be from the same protein), and concatenating them into the various n! permutations, where each permutation was individually submitted to the gapped BLAST program. This approach assumes that each peptide sequence is mostly correct (i.e., there were few errors in the de novo sequencing), and that the peptides were all from the same protein. An implementation called MS-BLAST [51] uses as input a list of candidate sequences for each MS/MS spectrum all linked together, where each sequence is separated by a gap symbol. The sequence candidates produced by de novo sequencing software can be very similar, so as to avoid having multiple similar sequence candidates contribute to the total alignment

R.S. Johnson et al. / Methods 35 (2005) 223–236

score, the parameter “-span1” is employed to ensure that only the best alignment score is included (rather than all of them). Other parameters need to be selected in the MS-BLAST implementation, and have been described elsewhere [51]. A website, utilizing the appropriate parameters, is also available (http://dove.embl-heidelberg.de/Blast2/msblast.html). The program FASTS is similar in that it can use multiple short sequences as input [52]. The source code (http://ftp.virginia.edu/pub/ fasta/), and a web interface (http://fasta.bioch.virginia.edu/) are available. To some extent, MS-BLAST and FASTS can partially account for the problem of mass equivalencies (Fig. 6) by adding more query sequences (e.g., there could be two identical queries, except where GG replaces N).

8. Validation of search results Whenever database searches are performed using MS/MS spectra, it is important to remember that the results obtained are not necessarily correct. Regardless of the speciWc scoring scheme used, each search engine will rank sequence candidates according to an assigned score. Given that for each MS/MS spectrum there will be at least one top ranked sequence, it still remains for users to verify that identiWcations are correct. In its crudest form, this is accomplished using simple threshold scores to discern correct from incorrect. Typically, these threshold values are obtained from publications coming from respected laboratories whose members chose values for unspeciWed reasons [53]. A more reWned approach is to run a second database search using a randomized database of the same size and amino acid composition, which is most easily done by reversing the original database [54,55]. The results obtained from the random or reversed database would presumably be incorrect (although this has to be veriWed) and the highest scores obtained would represent a threshold level above which one would anticipate few false positives. If a database search was performed using a larger multi-species database, a score threshold level might be discerned from high scoring matches to proteins speciWc to species that are known to be irrelevant. For example, when identifying proteins from a human source, high scoring matches to viral proteins might indicate that the threshold level is too low. Care should be taken with this approach, since we have had experiences where cell culture contaminations resulted in the presence of proteins from unexpected species. The problem with thresholds for delineating correct from incorrect peptide assignments is that no scoring scheme is perfect. For any given search engine and threshold score that is applied to a set of MS/MS spectra there will always be a certain number of false positives and false negatives. A more desirable threshold would be

233

probabilistic, where a certain level of error is deWned as being acceptable. In some cases, the search engine provides a probability estimate for each peptide match, which is usually given as the probability that the peptide score could be obtained randomly (i.e., the probability that the null hypothesis is correct) [12,13]. Even if the search engine score is not directly converted to a probability estimate (e.g., Sequest cross-correlation scores [14]), methods have been described that would allow calculation of expectation values [56], or similar statistical estimates of accuracy [54,57]. At the moment, this is an active area of investigation, and while no consensus has developed as to which approach is best, our laboratory currently prefers calculation of expectation values for individual MS/MS spectra. A separate problem is validating protein identiWcations. Here again, statistical methods are being developed [58,56], and there is no clear winner as to which approach is best. Practically speaking, in the near future people will continue to use what they are accustomed to using (which is often whatever is supplied by the instrument vendor); however, there are a number of simple rules and guidelines to follow that should eliminate most identiWcation errors. Of primary importance is the fact that identiWcations are more likely to be correct if more than one peptide matches a given protein. The degree to which ambiguity is reduced is dependent on the total number of MS/MS spectra used in a given search, and also on the size of the sequence database that was searched. The odds of accidentally matching two spectra to a single database entry go up when very large numbers of MS/MS spectra are used to search small sequence databases. Furthermore, unusually large database entries are more likely to match more than one spectrum by chance. For example, the protein titin contains about 35,000 residues, and therefore has about a 100-fold increased chance of randomly matching a spectrum compared to an average protein with 350 amino acids. In other words, the protein sequence coverage is an important factor when validating search results. Despite all of these caveats, a good rule of thumb is that matching more than one spectrum to a single protein is a good thing. Conversely, a single MS/MS spectrum matching to a particular database entry requires an extraordinary level of proof if it is to be believed. Another key validation tool is to verify that the peptides identiWed could be derived from the consensus cleavage of the protease that was employed. Typically trypsin is used, which cleaves on the C-terminal side of arginine or lysine, unless followed by proline. Missed cleavages can occur, particularly if the digestion was not done to completion. Otherwise, missed cleavages are usually due to the presence of a number of nearby acidic groups or are due to cleavage sites being adjacent to one another. Trypsin is not an exopeptidase, so once a cleavage occurs at one of the adjacent sites, it will not cut at

234

R.S. Johnson et al. / Methods 35 (2005) 223–236

the other. Hence, if a database search identiWes a sequence candidate that contains trypsin cleavage sites located in the middle of the peptide, and the surrounding residues are not acidic, then in the absence of other supporting information that sequence identiWcation might be considered suspect (assuming the digestion went to completion). A more frequent problem is that a researcher will perform a search without requiring that the two ends of the peptide result from a tryptic cleavage. Although one might occasionally Wnd a peptide containing only a single consensus cleavage, it is a very rare event for a tryptic peptide to result from two non-tryptic cleavages [59]. Other common sense rules can be applied to the validation of protein identiWcations. For example, identiWcations based on a single peptide are less probable if the peptide is short (<10 amino acids). Alkylation of cysteine is a very eVective reaction that will in most cases go to completion; therefore, one should expect that peptides derived from alkylated samples should contain alkylated cysteine. Sometimes a putative peptide assignment can be rejected based on a mismatch between amino acid composition and chromatographic retention time. For example, very basic peptides would not be expected to elute early from a strong cation exchange column [55]. Likewise, hydrophobic peptides would not be expected to elute early from a reversed phase column. In cases where it is important to get it right, synthesizing the putative peptide, and subjecting it to the same LC-MS/MS conditions as the sample will provide a satisfactory level of validation. If the synthesized peptide matches the elution time and fragmentation pattern of the peptide obtained from a sample then this provides very good evidence that the assignment is correct. Of course, this approach adds time (»2 weeks) and cost (»$300) per peptide, and would only be used when this eVort saved time and expense by helping to avoid erroneous hypotheses and wasted work later on. The serum and plasma proteomics literature is rich in examples demonstrating many of these validation principles. A recent compilation of serum results obtained from earlier “non-proteomic” literature, a 2D gel project [60], and two LC-MS/MS analyses [61,62] showed that of 1175 gene products identiWed from all four sources, only 195 were reported more than once [63]. For some, this diVerence suggests that the serum proteome is a rich source of a large variety of rare and therapeutically diagnostic protein signals [64]. To us, this would suggest that a number of protein identiWcation mistakes have been made. For example, in one of the human serum proteome studies [62], the authors identiWed several common proteins as well as some low abundance cytokines and growth factors. As expected, the high abundance proteins were identiWed from peptides mostly derived from one or two tryptic consensus cleavages. Only 3 out of 112 albumin peptides were completely non-tryptic; like-

wise, none of the 18 haptoglobin, 22 trypsin inhibitor, 15 -1 glycoprotein, 26 apolipoprotein, or 51 C4 complement peptides were completely non-tryptic. Curiously, the low abundance cytokines or growth hormones (e.g., atrial natriuretic factor, human growth hormone, inhibin, IL-12a, interferon, FGF-12, PSA, growth/diVerentiation factor 5) were each identiWed on the basis of single peptides, all of which had completely non-tryptic termini. Biochemically, it is unexpected that contaminating proteases (those responsible for the non-tryptic cleavages) would favor cleavage of substrates at lower concentrations (cytokines and growth factors), but leave the high concentration substrates virtually untouched. Furthermore, the fact that these proteins were identiWed on the basis of a single peptide should indicate that these extraordinary claims require extraordinary proof (see above). Although it is not possible to prove that all of these assignments are incorrect, it should be noted that one of them, a growth factor (Accession No. gi|494196), is a mutant that was produced in an Escherichia coli expression system. It is unlikely that the person from whom the serum sample was taken had been septic from this particular laboratory mutant. Other examples demonstrating questionable validation of protein identiWcation can be found in work describing the “low molecular weight serum proteome” [61]. In this work, a database search using MS/MS spectra of human serum proteins was performed using a database containing both human and viral sequences. As noted by the authors, any matches to viral proteins would almost certainly be incorrect. However, rather than adjusting the score threshold upwards to systematically eliminate all false positives, these authors appear to have left their threshold level unchanged and simply deleted viral proteins from their Wnal list of identiWed proteins. Thus, the obviously potential false positives were removed, because they were annotated as being viral, but the less obviously false positives, annotated as human, were not. This method of validation, as described, is not recommended. Using this approach, a number of low abundance proteins (deWned as <10 pg/ ml in serum) were reported to have been identiWed— angiotensinogen precursor, the Kangai-1 antigen, interleukin 15, and leukemia inhibitory factor. According to the Expasy website (http://us.expasy.org/cgi-bin/ nice2dpage.pl?P01019), angiotensinogen precursor is present in serum at a concentration of 30–60 g/ml, so this would not be considered to be a particularly rare protein. The Kangai-1 antigen protein was identiWed on the basis of a single peptide (QTSSSSLRMGAYVFI), where both ends are non-tryptic, yet a readily cleavable tryptic site is located in the middle. Interleukin 15 was identiWed on the basis of a single very short peptide containing many acidic residues (KECEELEEK) that is non-tryptic on the N-terminus, and has an unmodiWed cysteine despite having been derived from a serum

R.S. Johnson et al. / Methods 35 (2005) 223–236

sample that had been reduced and carbamidomethylated. Furthermore, their identiWcation of interleukin 15 was based on a cross-correlation score of 1.9, which is below their stated threshold value of 2.2, so it is not clear why this was listed as having been positively identiWed. The leukemia inhibitory factor is slightly more plausible, having been identiWed on the basis of a single longer peptide (DVTYGPDTSGKDVFQK) that is non-tryptic at only the N-terminus, but has an internal tryptic cleavage site, albeit adjacent to an acidic residue. One Wnal example will show the importance of being careful when validating protein identiWcations. In a paper describing the sheep lymph proteome [65], a paragraph is devoted to a discussion of glial Wbrillary acidic protein that was identiWed on the basis of a single fully tryptic peptide (LALDIEIATYR). The authors did not appear to have considered the possibility that this same peptide is also found in a variety of keratins, and that a far more likely explanation for this peptide would be that it was due to keratin contamination of the lymph sample. So in this case, the peptide may have been sequenced correctly, but the protein could not be deWnitively proven (i.e., was it sheep glial Wbrillary protein or just human keratin?). More data were needed to be certain.

9. Conclusion There are a number of informatics solutions available for performing mass mapping and database searches using MS/MS spectra. For more diYcult problems (e.g., working with organisms that are poorly represented in sequence databases), a few options exist for performing automated de novo sequencing followed by homologybased database searches. Despite such eVorts there will always be a large fraction of MS/MS spectra for which no matches or identiWcations are possible. This is typical, and one should resist the temptation to assume that there is gold in these piles of unidentiWed spectra. Automated de novo sequencing programs can be used to Xag the higher quality unmatched spectra; otherwise, expedience and caution forces the conclusion that the unsequenced MS/MS spectra are noise. While database searches are easily performed, there are currently no standards with respect to the validation of protein identiWcations, and the degree of skepticism exhibited by individuals is widely variable.

References [1] B.W. Gibson, K. Biemann, Proc. Natl. Acad. Sci. USA 81 (1984) 1956–1960. [2] M. Mann, P. Hojrup, P. RoepstorV, Biol. Mass Spectrom. 22 (1993) 338–345. [3] D.J.C. Pappin, P. Hojrup, A.J. Bleasby, Curr. Biol. 3 (1993) 327– 332.

235

[4] P. James, M. Quadroni, E. Carafoli, G. Gonnet, Biochem. Biophys. Res. Commun. 195 (1993) 58–64. [5] J.R. Yates III, S. Speicher, P.R. GriYn, T. Hunkapiller, Anal. Biochem. 214 (1993) 397–408. [6] W.J. Henzel, T.M. Billeci, J.T. Stults, S.C. Wong, C. Grimley, C. Watanabe, Proc. Natl. Acad. Sci. USA 90 (1993) 5011–5015. [7] K. Clauser, P. Baker, A. Burlingame, Anal. Chem. 71 (1999) 2871– 2882. [8] S.A. Carr, A.L. Burlingame, M.A. Baldwin, in: A.L. Burlingame, S.A. Carr, M.A. Baldwin (Ed.), Mass Spectrometry in Biology and Medicine, Humana, Totowa, NJ, 2000, pp. 553–561. [9] M. Mann, M. Wilm, Anal. Chem. 66 (1994) 4390–4399. [10] D.C. Stahl, K.M. Swiderek, M.T. Davis, T.D. Lee, J. Am. Soc. Mass Spectrom. 7 (1996) 532–540. [11] D.L. Tabb, A. Saraf, J.R. Yates III, Anal. Chem. 75 (2003) 6415– 6421. [12] H.I. Field, D. Fenyo, R.C. Beavis, Proteomics 2 (2002) 36–47. [13] D. Perkins, D. Pappin, D. Creasy, J. Cottrell, Electrophoresis 20 (1999) 3551–3567. [14] J.K. Eng, A.L. McCormack, J.R. Yates III, J. Am. Soc. Mass Spectrom. 5 (1994) 976–989. [15] L.Y. Geer, S.P. Markey, J.A. Kowalak, L. Wagner, M. Xu, D.M. Maynard, X. Yang, W. Shi, S.H. Bryant, J. Proteome Res. 3 (2004) 958–964. [16] R. Craig, R.C. Beavis, Rapid Commun. Mass Spectrom. 17 (2003) 2310–2316. [17] I.A. Papayannopoulos, Mass Spectrom. Rev. 14 (1995) 49–73. [18] P.J. Kersey, J. Duarte, A. Williams, Y. Karavidopoulou, E. Birney, R. Apweiler, Proteomics 4 (2004) 1985–1988. [19] J.S. Choudhary, W.P. Blackstock, D.M. Creasy, J.S. Cottrell, Proteomics 1 (2001) 651–667. [20] R. Moore, M. Young, T. Lee, J. Am. Soc. Mass Spectrom. 11 (2000) 422–426. [21] J.R. Yates III, S.F. Morgan, C.L. Gatlin, P.R. GriYn, J.K. Eng, Anal. Chem. 70 (1998) 3557–3565. [22] S.E. Stein, D.R. Scott, J. Am. Soc. Mass Spectrom. 5 (1994) 859–866. [23] D.L. Tabb, M.J. MacCoss, C.C. Wu, S.D. Anderson, J.R. Yates III, Anal. Chem. 75 (2003) 2470–2477. [24] I. Beer, E. Barnea, T. Ziv, A. Admon, Proteomics 4 (2004) 950–960. [25] M.T. Davis, C.S. Spahr, M.D. McGinley, J.H. Robinson, E.J. Bures, J. Beierle, J. Mort, W. Yu, R. Luethy, S.D. Patterson, Proteomics 1 (2001) 108–117. [26] R.S. Johnson, K. Biemann, Biochemistry 26 (1987) 1209–1214. [27] D.M. Creasy, J.S. Cottrell, Proteomics 2 (2002) 1426–1434. [28] K. Biemann, C. Cone, B.R. Webster, G.P. Arsenault, J. Am. Chem. Soc. 88 (1966) 5598–5606. [29] A.A. Kiryushkin, H.M. Fales, T. Axenrod, E.J. Gilbert, G.W.A. Milne, Org. Mass Spectrom. 5 (1971) 19–31. [30] K. Ishikawa, Y. Niwa, Biomed. Environ. Mass Spectrom. 13 (1986) 373–380. [31] T. Sakurai, T. Matsuo, H. Matsuda, I. Katakuse, Biomed. Mass Spectrom. 11 (1984) 396–399. [32] J. Fernandez-de-Cossio, J. Gonzalez, L. Betancourt, V. Besada, G. Padron, Y. Shimonishi, T. Takao, Rapid Commun. Mass Spectrom. 12 (1998). [33] R.S. Johnson, K. Biemann, Biomed. Environ. Mass Spectrom. 18 (1989) 945–957. [34] W.M. Hines, A.M. Falick, A.L. Burlingame, B.W. Gibson, J. Am. Soc. Mass Spectrom. 3 (1992) 326–336. [35] B. Ma, K. Zhang, C. Hendrie, C. Liang, M. Li, A. Doherty-Kirby, G. Lajoie, Rapid Commun. Mass Spectrom. 17 (2003) 2337–2342. [36] D. Dancik, T. Addona, K. Clauser, J. Vath, P. Pevzner, J. Comput. Biol. 6 (1999) 327–342. [37] J.A. Taylor, R.S. Johnson, Rapid Commun. Mass Spectrom. 11 (1997) 1067–1075. [38] R.S. Johnson, S.A. Martin, K. Biemann, J.T. Stults, J.T. Watson, Anal. Chem. 59 (1987) 2621–2625.

236

R.S. Johnson et al. / Methods 35 (2005) 223–236

[39] V. Wysocki, G. Tsaprailis, L. Smith, L. Breci, J. Mass Spectrom. 35 (2000) 1399–1406. [40] K.D. Ballard, S.J. Gaskell, J. Am. Chem. Soc. 114 (1992) 64–71. [41] A. Shevchenko, I. Chernushevich, W. Ens, K. Standing, B. Thomson, M. Wilm, M. Mann, Rapid Commun. Mass Spectrom. 11 (1997) 1015–1024. [42] J.A. Taylor, R.S. Johnson, Anal. Chem. 73 (2001) 2594–2604. [43] A.J. Liska, A. Shevchenko, Proteomics 3 (2003) 19–28. [44] J.A. Taylor, R.S. Johnson, Anal. Chem. 73 (2001) 2594–2604. [45] B.C. Searle, S. Dasari, M. Turner, A.P. Reddy, D. Choi, P.A. Wilmarth, A.L. McCormack, L.L. David, S.R. Nagalla, Anal. Chem. 76 (2004) 2220–2230. [46] S. HenikoV, J.G. HenikoV, Proc. Natl. Acad. Sci. USA 89 (1992) 10915–10919. [47] M.O. DayhoV, R.M. Schwartz, B.C. Orcutt, in: M.D. DayhoV (Ed.), Atlas of Protein Sequence and Structure, vol. 5, National Biomedical Research Foundation, Washington, DC, 1978, pp. 345–352. [48] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, J. Mol. Biol. 215 (1990) 403–410. [49] W.R. Pearson, D.J. Lipman, Proc. Natl. Acad. Sci. USA 85 (1988) 2444–2448. [50] L. Huang, R.J. Jacob, S.C.H. Pegg, M.A. Baldwin, C.C. Wang, A.L. Burlingame, P.C. Babbitt, J. Biol. Chem. 276 (2001) 28327–28339. [51] A. Shevchenko, S. Sunyaev, A. Loboda, A. Shevchenko, P. Bork, W. Ens, K.G. Standing, Anal. Chem. 73 (2001) 1917–1926. [52] A.J. Mackey, T.A.J. Haystead, W.R. Pearson, Mol. Cell. Proteomics 1 (2002) 139–147.

[53] M.P. Washburn, D. Wolters, J.R. Yates III, Nature Biotechnol. 19 (2001) 242–247. [54] R.E. Moore, M.K. Young, T.D. Lee, J. Am. Soc. Mass Spectrom. 13 (2002) 378–386. [55] K.A. Resing, K. Meyer-Arendt, A.M. Mendoza, L.D. AvelineWolf, K.R. Jonscher, K.G. Pierce, W.M. Old, H.T. Cheung, S. Russell, J.L. Wattawa, G.R. Goehle, R.D. Knight, N.G. Ahn, Anal. Chem. 76 (2004) 3556–3568. [56] D. Fenyo, R.C. Beavis, Anal. Chem. 75 (2003) 768–774. [57] A. Keller, A.I. Nesvizhskii, E. Kolker, R. Aebersold, Anal. Chem. 74 (2002) 5383–5392. [58] A.I. Nesvizhskii, A. Keller, E. Kolker, R. Aebersold, Anal. Chem. 75 (2003) 4646–4658. [59] J.V. Olsen, S.E. Ong, M. Mann, Mol. Cell. Proteomics 3 (2004) 608–614. [60] R. Pieper, Q. Su, C.L. Gatlin, S.T. Huang, N.L. Anderson, S. Steiner, Proteomics 3 (2003) 422–432. [61] R.S. Tirumalai, K.C. Chan, D.A. Prieto, H.J. Issaq, T.P. Conrads, T.D. Veenstra, Mol. Cell. Proteomics 2 (2003) 1096–1103. [62] J.N. Adkins, S.M. Varnum, K.J. Auberry, R.J. Moore, N.H. Angell, R.D. Smith, D.L. Springer, J.G. Pounds, Mol. Cell. Proteomics 1 (2002) 947–955. [63] N.L. Anderson, M. Polanski, R. Pieper, T. Gatlin, R.S. Tirumalai, T.P. Conrads, T.D. Veenstra, J.N. Adkins, J.G. Pounds, R. Fagan, A. Lobley, Mol. Cell. Proteomics 3 (2004) 311–326. [64] E. Petricoin, L.A. Liotta, Clin. Chem. 49 (2003) 1276–1278. [65] L.V. Leak, L.A. Liotta, H. Krutzsch, M. Jones, V.A. Fusaroa, S.J. Ross, Y. Zhao, E.F. Petricoin III, Proteomics 4 (2004) 753–765.