Mass Accuracy and Sequence Requirements for Protein Database Searching

Analytical Biochemistry 275, 39 – 46 (1999) Article ID abio.1999.4270, available online at http://www.idealibrary.com on Mass Accuracy and Sequence R...

Download PDF

472KB Sizes 1 Downloads 75 Views

Report

PDF Reader
Full Text

Analytical Biochemistry 275, 39 – 46 (1999) Article ID abio.1999.4270, available online at http://www.idealibrary.com on

Mass Accuracy and Sequence Requirements for Protein Database Searching M. Kirk Green,* ,1 Murray V. Johnston,* ,2 and Barbara S. Larsen† *Department of Chemistry and Biochemistry, University of Delaware, Newark, Delaware 19716; and †Central Research and Development, The DuPont Company, P.O. Box 80288, Wilmington, Delaware 19880

Received April 8, 1999

To elucidate the role of high mass accuracy in mass spectrometric peptide mapping and database searching, selected proteins were subjected to tryptic digestion and the resulting mixtures were analyzed by electrospray ionization on a 7 Tesla Fourier transform mass spectrometer with a mass accuracy of 1 ppm. Two extreme cases were examined in detail: equine apomyoglobin, which digested easily and gave very few spurious masses, and bovine a-lactalbumin, which under the conditions used, gave many spurious masses. The effectiveness of accurate mass measurements in minimizing false protein matches was examined by varying the mass error allowed in the search over a wide range (2–500 ppm). For the “clean” data obtained from apomyoglobin, very few masses were needed to return valid protein matches, and the mass error allowed in the search had little effect up to 500 ppm. However, in the case of a-lactalbumin more mass values were needed, and low mass errors increased the search specificity. Mass errors below 30 ppm were particularly useful in eliminating false protein matches when few mass values were used in the search. Collision-induced dissociation of an unassigned peak in the a-lactalbumin digest provided sufficient data to unambiguously identify the peak as a fragment from a-lactalbumin and eliminate a large number of spurious proteins found in the peptide mass search. The results show that even with a relatively high mass error (0.8 Da for mass differences between singly charged product ions), collision-induced dissociation can help identify proteins in cases where unfavorable digest conditions or modifications render digest peaks unidentifiable by a simple mass mapping search. © 1999 Academic Press

1 Current address: Department of Chemistry, McMaster University, Hamilton, Ontario, Canada L8S 4M1. 2 To whom correspondence should be addressed. Fax: 302-8316335. E-mail: [email protected].

0003-2697/99 $30.00 Copyright © 1999 by Academic Press All rights of reproduction in any form reserved.

In the past few years, mass spectrometry has come to play a critical role in protein identification, particularly in applications such as analysis of 2D gels, because of its capability for accurate mass measurement, its ability to provide detailed information from small amounts of sample, and its potential for high-speed, high-volume processing (1–3). The combination of peptide mass mapping and database searching has proven especially effective. In this approach, a protein is enzymatically digested (typically with trypsin, although a large number of alternate endopeptidases are available) and the masses of the resulting peptides are measured by matrix-assisted laser desorption ionization (MALDI) 3 or electrospray ionization (ESI) mass spectrometry. These masses are then compared with predicted digest masses from a database of protein sequences to identify the protein. A number of software packages for this purpose are readily accessible on the Internet (4), each differing slightly in data requirements, flexibility, and format. In the ideal case, a set of digest masses is submitted to one of these packages, and only one protein is found in the database which matches all of the masses submitted. In practice, not all the masses submitted will be matched, and there may be more than one “good” protein match or none. Another possibility is that the match(es) returned will be incorrect, that is, there will be a protein in the database whose theoretical digest peptides will fortuitously match a number of the submitted masses. If no good matches are found, then either the protein does not exist in the database or the digest was compromised for some reason. In the case of several good matches or a false match, the problem becomes one of increasing the specificity of the search so that false 3 Abbreviations used: ESI, electrospray ionization; FTICR, Fourier transform ion cyclotron resonance; MALDI, matrix-assisted laser desorption ionization; TOF, time-of-flight; MS/MS, mass spectrometry/mass spectrometry; CID, collision-induced dissociation.

39

40

GREEN, JOHNSTON, AND LARSEN

matches are ruled out and only the correct match remains. A number of strategies are available to increase the specificity of the database search. If the information is available, most software packages allow the user to restrict the search based on the taxonomic category and the molecular mass of the parent protein. However, both of these may be undesirable, the former because it rules out matches to homologous proteins across species boundaries (5) and the latter because extensive posttranslational modification may have significantly altered the molecular mass from that of the corresponding amino acid sequence in the database. Other possible restrictions include amino acid composition, more accurate mass measurement of the peptide fragments, and sequence information from chemical sequencing or MS/MS of the digest peptides. A number of papers have explored the benefits of combining peptide mass and sequence information to provide highly reliable results, even in the presence of interferences such as tryptic peptides from other proteins or contamination (6 – 8). It has been shown that MS/MS information from a single peptide may be sufficient to unambiguously identify a protein (1). The utility of accurate mass measurement has not been completely examined. Decreasing the mass error allowed in the searching has obvious benefits: all mass matches that are close, but not exact, can in principle be ruled out. Recent work by Mann and co-workers (9) showed the utility of delayed extraction MALDI timeof-flight (TOF) mass measurements. In this study, the ability of the instrument to provide highly accurate mass measurement (,30 ppm) allowed these researchers to specify a mass error of 50 ppm in their matches, greatly increasing the specificity of their searches. The drastic decrease in the number of single peptides matching a given mass as the allowed mass error is lowered from 1 to 0.01 Da is well-documented (10 –12). In addition, Cao and Moini have shown that reducing the mass accuracy of tryptic peptides in a database search to 10 ppm can significantly reduce the number of proteins returned (13). However, there is some question as to the utility of even higher mass accuracy since some “incorrect” peptide mass matches will never be eliminated because of mass redundancy: different peptides may have exactly the same molecular mass. For example, the tryptic end terminal sequence of bovine a-lactalbumin, (2)EQLTK, has an exact mass of 618.3463, but so do the tryptic fragments (K)QTLEK, from Staphylococcus aureus gene A, and (K)VKEDK, from Methanococcus jannaschii polyferridoxin. In fact, these two unrelated proteins have two and four tryptic fragments, respectively, which are exact mass matches to a-lactalbumin tryptic fragments. TOF (9, 12–14) and sector (15) instruments have been used to measure the masses of peptides to 50 ppm

or better. Fourier transform ion cyclotron resonance (FTICR) is capable of highly precise mass measurement with accuracies on the order of 1 ppm over a wide mass range (16 –18). This allows the possibility of performing database searches with very stringent mass error criteria. In this paper, masses of the tryptic digest peptides from two common proteins, myoglobin and a-lactalbumin, were measured to high accuracy (;1 ppm) and the dependence of search specificity on the allowed mass error and the number of submitted masses was examined. The ability of sequence information from collision-induced dissociation (CID) to supplement accurate mass data was also studied. MATERIALS AND METHODS

Equine apomyoglobin and bovine a-lactalbumin (Sigma, used as purchased) were dissolved at a concentration of 100 mg/ml in 50 mM NH 4OAc adjusted to pH 8. No denaturation or reduction and alkylation procedures were performed. In the case of a-lactalbumin, this approach was expected to result in the production of many spurious peptides (uncleaved disulfide bridges) in the digest that are not recognized in the database searching routine. Thus, the a-lactalbumin digest simulates an “impure” sample while the myoglobin digest represents a “clean” sample. Trypsin (Sigma, CPKT treated) was added to give a 50:1 protein:enzyme ratio. Proteins were digested for 36 h at 37°C, and the digest solution was vacuum dried overnight. The tryptic peptides were reconstituted in 50:50 MeOH/1% HOAc to a concentration of 100 mg/ml for ESI analysis. The tryptic solutions were infused without separation at a rate of 0.5 ml/min into a Bruker BioApex FTICR operating at 7 Tesla, equipped with an Analytica ESI source. Spectra were obtained by averaging 10 scans of 256k data points. Resolving power was typically 20,000 at m/z 1000, and only peaks with S/N . 5 were used in the searches. Data were processed using XMASS software, and all reported masses are monoisotopic. Spectra were internally calibrated with a mixture of glutathione, angiotensin I, and substance P. Masses of the internal standard peaks were accurate to within 1 ppm. CID of selected peptides on the FTICR was done using sustained off-resonance irradiation (SORI) with an offset of 2700 Hz. MS-Fit (4a) was chosen as the mass map analysis software because of its ease of use and flexibility, particularly in the number of missed cleavage sites allowed. The protein database searched was the National Center for Biotechnological Information protein database (19), a nonredundant database which is compiled from several others and updated regularly. Sets of the n most intense peaks, where n was varied from 5 to 60, were tested against the database with the allowed

41

PROTEIN DATABASE SEARCHING

FIG. 1. Mass spectra of the (a) myoglobin and (b) a-lactalbumin tryptic digests. (Inset) An expansion of the 812– 814 m/z region. Peaks with the same number correspond to different charge states of the same species and are numbered in order of decreasing abundance of the species. C, internal calibrant peak.

mass error varied from 2 to 500 ppm. The CID data were tested against the same database using a related program, MS-Tag (4a). RESULTS AND DISCUSSION

The ESI-FTICR spectra of the myoglobin and a-lactalbumin digests are shown in Fig. 1. Peaks labeled with the same number represent different charge states of the same mass species and are numbered in order of total intensity summed over the charge states. For the myoglobin digest, the accuracy of the five mass calibrant peaks (two charge states of angiotensin I and substance P; one charge state of glutathione) between 308 and 1350 m/z was 1.2 ppm, and for a-lactalbumin, the internal consistency of the four peaks between 308 and 1051 m/z was 0.8 ppm. The a-lactalbumin digest spectrum has relatively few peaks compared to the myoglobin spectrum. One reason for this is simply that

a-lactalbumin is a smaller protein (12 kDa as opposed to 17 kDa). Additionally, a-lactalbumin possesses four disulfide bonds, which were not cleaved under the conditions used, leading to fewer fragments. The most abundant species in these two spectra are listed in Table 1. Table 2 summarizes the results returned from database searches using the 5 and the 10 most abundant species from the myoglobin digest spectrum. When the 5 most abundant species were submitted with the requirement that mass matches must be within 2 ppm (row 1, columns 1–5), 4 proteins were found in the database which had tryptic fragments matching the masses of all 5 submitted species. An additional 3 proteins were found which had fragments matching 4, but could not match all 5. No proteins were found which matched exactly 3 of the 5 submitted masses, and 25 were found which had tryptic fragments matching exactly 2 of the 5 submitted masses. The four best fits were equine myoglobin, equine myoglobin with an apparent sequence error (D 122 listed as N), and two point mutations. All four differ from each other by only one amino acid and could not be differentiated because the sequence coverage (44% in this case) did not extend over the regions containing the different residues. Increasing the mass error allowed up to 500 ppm has no effect on the number of proteins returned (column 2), because the probability of a protein spuriously matching all 5 submitted masses, even with relaxed error tolerance, is low. At low mass error (,50 ppm) the second best rank of proteins (exactly four of five submitted masses matched) contains only three members, all of which are also equine myoglobin point mutations. In this case, increasing the mass error above 30 ppm increases the number of proteins returned, as there are some unre-

TABLE 1

The 10 Most Abundant Species Observed in the Tryptic Digest Mass Spectra a Myoglobin (61.2 ppm)

a-Lactalbumin (60.8 ppm)

1606.8540 1885.0200 1378.8420 1815.9003 748.4345 1502.6680 631.3404 564.3139 1158.5785 650.3140

4057.7350 1200.6521 1797.8680 618.3460 751.3777 565.2985 375.2241 488.3079 389.2398 1776.8669

Mass values (Da), reported as the m/z of MH 1, are listed in descending order of intensity. a

42

GREEN, JOHNSTON, AND LARSEN

lated proteins in the database which provide spurious matches to four of five of the submitted masses. This tendency for a higher mass error to allow selection of unrelated proteins is also displayed in the lower quality protein fits: at low mass error (below 10 ppm), no proteins are returned which match exactly 3 of 5 submitted masses, but as the mass error increases, the number of proteins returned increases dramatically: 5 at 30 ppm error and 268 at 500 ppm error. For the poorest fits, only 2 of 5 submitted masses matched; 20/25 (80%) of the proteins returned are myoglobins (of other species) at the lowest mass error, 2 ppm. As the mass error is increased, however, the myoglobins in this category are quickly overwhelmed by spurious matches. The results of a search using the 10 most abundant species from the myoglobin spectrum are also displayed in Table 2. In this case, the best proteins matched only 8 of 10 submitted masses because 2 of the masses submitted corresponded to nonspecific cleavage products and were not recognized. The increase in specificity gained by including more masses in the submitted set shifts only 1 protein out of the best rank, leaving 3, despite the fact that sequence coverage has now increased to 61%. The benefits of the larger number of submitted masses are most apparent with the lower quality matches and at low mass error. For example, even in the worst fits shown (only 4 of 10 masses matched), at 2 ppm mass error, all 8 of the proteins returned are myoglobins and unrelated pro-

teins have been eliminated. Summarizing the results presented in Table 2, it appears that high mass accuracy is most important for the relatively poor protein fits. The selectivity of the search for the best fits is essentially unaffected by mass accuracy over the range examined. Selecting unmodified equine myoglobin as the unique best fit protein was unusually difficult because of the presence of point mutations in the database. However, when all of the 64 masses from the digest spectrum were submitted in a search, complete sequence coverage was achieved, and unmodified equine myoglobin was finally selected as the unique best-fitting protein. Restricting the search by species of origin, normally an effective strategy when the species is known, was not effective in distinguishing between the best fits, because the point mutations in the database are, in fact, all mutations of equine myoglobin. CID data would not be useful in ruling out closely related proteins unless the tryptic peak analyzed contained the sequence difference in question. One strategy which is effective in this case is restricting the mass of the protein in the search: FTICR has sufficient accuracy to determine the mass of a protein of this size to better than 1 Da. In an earlier study, we were able to measure the mass of the most abundant peak in the mass spectrum of this protein to be 16.951.00 6 0.02 (20), allowing the average mass of the parent protein to be restricted to 16,951.5 6 0.5 (21). This degree of specificity allows the unique selection of unmodified

PROTEIN DATABASE SEARCHING

equine myoglobin (MW 16951.6) ruling out the other top candidates, which have average masses of 16,950.6, 16977.6, and 16979.6. Such stringent protein mass restrictions must be used with caution, however: correct protein fits may inadvertently be excluded if their mass has been altered due to modification. Table 3 summarizes the search results obtained using the 5 and 10 most abundant masses of the a-lactalbumin digest. When the 5 most abundant masses were submitted, because 3 of these 5 did not correspond to tryptic peptides of a-lactalbumin, the best protein fits only matched 2 of the 5 submitted peptides at 2 ppm mass error. The 10 best proteins include 4 a-lactalbumins or related proteins and 6 spurious hits. Increasing the mass error allowed rapidly makes the situation even worse: at a mass error of 50 ppm, almost 1000 proteins are returned, mostly unrelated, which match exactly 2 of 5 submitted masses. At this mass error, there is a single best protein which matches 4 of 5 submitted masses, but this protein is unrelated to a-lactalbumin. The results of a search with the 10 most abundant a-lactalbumin digest masses are more encouraging. At a mass error of 20 ppm or less, a unique best fit protein is found, and although it only matches 5 of the 10 submitted masses, it is the correct protein. The other 5 submitted masses were not simple tryptic peptides (vide supra). In this case, the target protein was identified, even though the sequence coverage is relatively low, only 20%. This is in contrast to the myoglobin case, in which 61% sequence coverage was insufficient to uniquely identify unmodified equine myoglobin. How-

43

ever, the presence of several highly homologous proteins in the database (a sequence error and point mutations) makes myoglobin a somewhat aberrant case. Mann and co-workers (2) have suggested that a minimum of 15% sequence coverage should be considered acceptable for protein identification. The ability of the search to uniquely identify the correct protein is maintained for larger allowed errors up to 30 ppm, at which point 2 unrelated proteins are selected in the best rank (now 6 of 10, rather than 5 of 10) and bovine a-lactalbumin has been relegated to the second best rank, which it shares with 5 unrelated proteins. Past this point, as in the case of 5 submitted masses, the number of false positives returned increases rapidly with allowed error. Above 50 ppm error, a unique best fit protein is returned which matches 8 out of 10 masses, but this protein is unrelated to a-lactalbumin. For the a-lactalbumin digest, the difficulty in identifying the correct protein can be ascribed to the large number of nontryptic masses submitted, which limits the number of matches to be expected from a related protein and also increases the probability of a random protein matching a significant number of the submitted masses. It appears that for the a-lactalbumin digest, with a large number of nontryptic peaks, achieving a low mass error is much more important than for the relatively clean myoglobin digest. In the case of a-lactalbumin, nonmatching peaks were intentionally introduced to study their effect on a database search by omitting the reduction and alkylation of the disulfide bonds. In practical applications, nonmatching peaks could be (and often are) present due to

44

GREEN, JOHNSTON, AND LARSEN

FIG. 2. Collision-induced dissociation spectra of the MW 4057 peptide (m/z 5 812.3528). Numbers indicate the charge state of the peak. P, parent ion.

other factors, such as posttranslational modifications of proteins or contamination. It should be noted that while submitting more masses in a search generally increases the specificity of the search, this is not always true. For the a-lactalbumin digest in this work, increasing the number of submitted masses to 20 actually decreased the specificity of the search at mass errors above 2 ppm (not shown). This loss of specificity occurred because in the a-lactalbumin digest, only 5 of the 20 most abundant species in the digest mass spectrum were legitimate tryptic peptides. Thus, the effect of increasing the number of submitted masses beyond 10 was to add more and more “noise,” in the form of nontryptic peaks, to the submitted set. As a result, more unrelated peptides were selected into the top rank, even for errors as low as 5 ppm. A question may arise as to what reliability can be expected for a given mass error and number of peptide masses matched. Reliability should generally be greater for proteins returned with a lower mass error and a higher number of peptide masses matched. However, submitted masses which are in fact spurious and do not match will have a deleterious effect on the outcome. Thus, the reliability also has a strong dependence on the fraction of the peptide masses submitted which are actually matched. Based on this reasoning and examination of Tables 2 and 3, it may be useful to define a “reliability index” (RI) in the following manner: RI 5 NF/log E, where RI is the reliability index, N is the number of peptide masses matched, F is the fraction of the peptide masses submitted which are actually matched, and E is the mass error in ppm. For the mass errors and number of species submitted/ matched in Tables 2 and 3, the corresponding values of RI are between 0.3 and 33. As a rule of thumb, the data

in these tables suggest that RI values much larger than 3 appear to give good protein fits, while RI values much lower than 3 appear to give unreliable results. The dashed line in Tables 2 and 3 divides cells with R.I. values above and below 3.5 to qualitatively delineate reliable and unreliable results. Failure to cleave disulfide bonds and posttranslational modification may be thought of as “disguises” for tryptic peptides: they may be present, but because of a shift in mass, they are invisible to the database searching routine. The same problem is encountered if some residues are inadvertently modified chemically during isolation/purification. Furthermore, it might be desirable in some cases to find related proteins with partial homology; too high a frequency of amino acid substitutions could make detection of these proteins impossible, even if homology is still high. In all of these cases, CID data could provide sufficient sequence-dependent information to help identify a protein despite mass shifts of the tryptic peptides. As a test of the utility of CID, the most abundant unmatched species in the a-lactalbumin digest (m/z of MH 1 5 4057.7350) was selected for CID analysis. The CID spectrum of the 15 charge state of this peptide (m/z 812.3528) is illustrated in Fig. 2. The spectrum is dominated by two series of peaks, which, because of the high resolution of FTICR, can be determined by simple inspection to be quadruply and triply charged ions. The observed peaks are summarized in Table 4. Since the mass of the parent ion was known to be “in error” (not corresponding to any tryptic fragment), the observed fragment masses were not used directly. Instead, the calculated mass differences between these peaks and

TABLE 4

Ions Observed in the SORI CID of the a-Lactalbumin m/z 812.3528 Peak a 14 Peaks 978.3502 950.0781 921.3262 892.5766 863.8235 835.5559 781.5405 752.7872

13 Peaks 1266.4157 1228.4050 1189.7449 1151.4019 1113.7163 1080.0360 1041.7137 1003.3636 965.6750

11 Peaks 261.1420

606.2110 719.2184

1050.3388

Calc. series b

Amino acid c

148.0992 261.2166 376.1952 491.2258 606.2465 719.3101 820.3847 935.3448 1050.3770 1163.4680

F/(K/Q)H 3O 1 I/L D D D I/L T D D I/L

Charge state of this peak is 15; mass value based upon MH 1 is 4057.7350. b Mass values, listed as mass differences of singly charged ions, were calculated from the m/z differences of multiply charged product ions and the parent ion. c Tentative identification of amino acid based on observed mass difference. a

PROTEIN DATABASE SEARCHING TABLE 5

Results of Protein Database Searches Performed Using a 10-Member Singly Charged Ion Series Derived from the FTICR CID Data of the a-Lactalbumin m/z 812.3528 Peak Number of proteins returned from the database containing a tryptic peptide whose CID fragments matched the calc. series in Table 4 Mass error (Da)

10/10 ions

0.1 0.2 0.3 0.5 0.8 1.0 1.5 2.0

38 a 38 a 38 a 38 a 38 a 87 615 .1000

a

Only 9/10 ions

Only 8/10 ions

Only 7/10 ions

10 15 15 15 15 .1000

25 74 82 82 92 .1000

270 729 778 780 .1000

All 38 proteins returned were a-lactalbumins or related species.

the parent ion were used to construct a 10-member singly charged ion series for submission to MS-Frag. Table 5 summarizes the results returned from the database search using all 10 ion masses. Ion types allowed were a, b, b 1 H 2O, y, and internal fragments, and a mass error of 3 kDa was allowed for the parent ion, to allow for matches to a wide range of possible tryptic fragments. The mass error used in the search was varied from 0.1 to 2.0 Da. When a mass error of 0.1 Da was specified (row 1 of Table 5), 38 proteins were found in the database which had a tryptic fragment whose predicted CID fragment ions matched all 10 masses submitted. A further 10 proteins were found which had a tryptic fragment matching only 9 (the first 9 in all cases) of the 10 masses submitted. Twenty-five proteins were found which matched exactly 8 of the submitted masses, and 270 were found which matched exactly 7 of the masses submitted. The 38 best fits were all a-lactalbumins or biological precursors, and the matches identified the submitted masses as the y-ion series, FLDDDLTDDI. In bovine a-lactalbumin, this series corresponds to the tryptic fragment F 80-K 93, FLDDDLTDDI(MCVK). Subsequent inspection of the a-lactalbumin sequence revealed that MH 1 5 4057.7350 corresponds to the fragment I 59-K 93, with two intact disulfide bridges and a cleavage at K 79 (theoretical mass 5 4057.7386). Subsequently, CID of the other large unidentified peaks revealed that they were also tryptic fragments with intact disulfide bridges. This illustrates the ability of CID to yield critical identifying information from a single peptide, and, furthermore, the ability of CID to provide this information from a peptide which was useless for a tryptic mass map search because it did not conform to the expected pattern.

45

Increasing the allowed mass error up to 0.8 Da had no effect on the number of best fits obtained (column 1 of Table 5). Beyond this point (1 Da and higher) the number of best matches returned rose sharply, and many of the sets of ions identified contained ions from two or more different series—the larger mass errors allowed fragments containing different amino acids to match the masses submitted. It should be noted, however, that the 0.8-Da limit indicated in Table 5 represents the highest tolerance for error in a mass difference measurement of a singly charged ion series. If the product ions used to deduce the series are multiply charged, then the tolerance for error in a mass difference measurement becomes 0.8 Da/electronic charge (e.g., 0.2 m/z for quadruply charged ions). Although CID of the m/z 812.3528 ion identified it as a sequence from a protein related to a-lactalbumin, the a-lactalbumin was not uniquely identified. Further information, either in the form of the digest masses, or as CID data from other tryptic fragments, would be required to obtain a unique match. In the case examined here, the a-lactalbumins found in the tryptic mass search are a subset of those found in the CID fragment search. However, the CID data are an effective supplement to the digest mass data, since the CID information that the protein is related to a-lactalbumin allows the elimination of unrelated proteins from the tryptic database search results. Thus, the results of the search using 5 masses (Table 3) can be restricted to the four protein hits which are a-lactalbumins, and in the searches using 10 masses, bovine a-lactalbumin can be uniquely identified even at the highest mass error allowed. CONCLUSIONS

Highly accurate (2 ppm) mass measurements for tryptic peptides are not critical for database searches using peptide masses from “well-behaved” digests, in which most or all of the observed species in the mass spectrum are in fact tryptic peptides of a single protein. However, as the population of extraneous peaks in the digest mass spectrum increases, the probability of false matches being returned from a database increases. Under these conditions, high mass accuracy can provide the necessary discrimination to unambiguously identify a protein, even in the presence of considerable chemical noise. CID of digest peaks even with relatively high mass errors (#0.8 Da for the mass difference between singly charged product ions) can prove useful for identifying proteins, particularly in cases where unfavorable digest conditions or modifications have made the digest peaks unidentifiable by a simple mass mapping search. Information obtained from CID complements mass mapping information by eliminating many of the spu-

46

GREEN, JOHNSTON, AND LARSEN

rious protein fits. Performing CID with FTICR is advantageous since the charge states of multiply charged product ions are readily determined and the mass differences between product ions can be measured with sufficient accuracy for effective database searching. ACKNOWLEDGMENTS This work was supported by a grant from the National Science Foundation (M.V.J., Grant CHE-9629672), a Grant Opportunities for Academic Liaison with Industry (GOALI) supplement from the National Science Foundation (M.V.J. and B.S.L., Grant CHE-9300644), the University of Delaware, and the DuPont Company.

REFERENCES 1. Yates, J. R. (1998) J. Mass Spectrom. 33, 1–19. 2. Jensen, O. N., Podtelejnikov, A. V., and Mann, M. (1997) Anal. Chem. 69, 4741– 4750. 3. Humphery-Smith, I., Cordwell, S. J., and Blackstock, W. P. (1997) Electrophoresis 18, 1217–1242. 4. (a) MS-Fit, MS-Tag: http://prospector.ucsf.edu (version 3.0 was used in this work). (b) PeptideSearch: http://www.mann. embl-heidelberg.de. (c) ProFound: http://prowl.rockefeller.edu. (d) MOWSE: http://www.seqnet.dl.ac.uk. (e) MultiIdent: http:// expasy.hcuge.ch/tools/multiident.html. (f) Peptide Mass Search: http://www.mdc-berlin.de/;emu/. (g) MassSearch: http://cbrg.inf. ethz.ch. 5. Cordwell, S. J., and Humphery-Smith, I. (1997) Electrophoresis 18(8), 1410 –1417. 6. Neubauer, G., Gottschalk, A., Fabrizio, P., Seraphin, B., Luhrmann, R., and Mann, M. (1997) Proc. Natl. Acad. Sci. USA 94, 385–390.

7. Clauser, K. R., Hall, S. C., Smith, D. M., Webb, J. W., Andrews, L. E., Tran, H. M., Epstein, L. B., and Burlingame, A. L. (1995) Proc. Natl. Acad. Sci. USA 92, 5072–5076. 8. Patterson, S. D., Thomas, D., and Bradshaw, R. A. (1996) Electrophoresis 17, 877– 891. 9. Jensen, O. N., Podtelejnikov, A., and Mann, M. (1996) Rapid Commun. Mass Spectrom. 10, 1371–1378. 10. Yates, J. R. (1998) Electrophoresis 19, 893–900. 11. Fenyo, D., Qin, J., and Chait, B. T. (1998) Electrophoresis 19, 998 –1005. 12. Takach, E. J., Hines, W. M., Patterson, D. H., Juhaz, P., Falick, A. M., Vestal, M. L., and Martin, S. A. (1997) J. Protein Chem. 16, 363–369. 13. Cao, P., and Moini, M. (1998) Rapid Comm. Mass Spectrom. 12, 864 – 870. 14. Russell, D. H., and Edmondson, R. D. (1997) J. Mass Spectrom. 32, 263–276. 15. McEwen, C. N., and Larsen, B. S. (1992) Rapid Comm. Mass Spectrom. 6, 173–178. 16. Dienes, T., Pastor, S. J., Schurch, S., Scott, J. R., Yao, J., Cui, S. L., and Wilkins, C. L. (1996) Mass Spectrom. Rev. 15, 163–211. 17. Amster, I. J. (1996) J. Mass Spectrom. 31, 1325–1337. 18. Winger, B. E., and Campana, J. E. (1996) Rapid Commun. Mass Spectrom. 10, 1811–1813. 19. National Center for Biotechnology Information (1997) Entrez: http://www.ncbi.nlm.nih.gov/Entrez/. 20. Green, M. K., Vestling, M. M., Johnston, M. V., and Larsen, B. S. (1998) Anal. Biochem. 260, 204 –211. 21. Zubarev, R. A., Demirev, P. A., Hakansson, P., and Sundqvist, B. U. R. (1995) Anal. Chem. 67, 3793–3798.

Mass Accuracy and Sequence Requirements for Protein Database Searching

Mass Accuracy and Sequence Requirements for Protein Database Searching

Recommend Documents