Bioinformatics support for high-throughput proteomics

Bioinformatics support for high-throughput proteomics

Journal of Biotechnology 106 (2003) 147–156 Bioinformatics support for high-throughput proteomics Andreas Wilke a,∗ , Christian Rückert a,b , Daniela...

687KB Sizes 0 Downloads 46 Views

Journal of Biotechnology 106 (2003) 147–156

Bioinformatics support for high-throughput proteomics Andreas Wilke a,∗ , Christian Rückert a,b , Daniela Bartels a , Michael Dondrup a , Alexander Goesmann a , Andrea T. Hüser c , Sebastian Kespohl a , Burkhard Linke a , Martina Mahne c , Alice McHardy a , Alfred Pühler c , Folker Meyer a a

b

Center for Genome Research, Bielefeld University, Bielefeld D-33594, Germany International Graduate School in Bioinformatics & Genome Research, Bielefeld University, Bielefeld D-33594, Germany c Lehrstul für Genetik, Department of Biology, Bielefeld University, Bielefeld D-33594, Germany Received 29 April 2003; received in revised form 12 August 2003; accepted 18 August 2003

Abstract In the “post-genome” era, mass spectrometry (MS) has become an important method for the analysis of proteome data. The rapid advancement of this technique in combination with other methods used in proteomics results in an increasing number of high-throughput projects. This leads to an increasing amount of data that needs to be archived and analyzed. To cope with the need for automated data conversion, storage, and analysis in the field of proteomics, the open source system ProDB was developed. The system handles data conversion from different mass spectrometer software, automates data analysis, and allows the annotation of MS spectra (e.g. assign gene names, store data on protein modifications). The system is based on an extensible relational database to store the mass spectra together with the experimental setup. It also provides a graphical user interface (GUI) for managing the experimental steps which led to the MS data. Furthermore, it allows the integration of genome and proteome data. Data from an ongoing experiment was used to compare manual and automated analysis. First tests showed that the automation resulted in a significant saving of time. Furthermore, the quality and interpretability of the results was improved in all cases. © 2003 Elsevier B.V. All rights reserved. Keywords: Proteomics; Automated analysis; Platform; Archival of data

1. Introduction With the rise of the first completed genomes (e.g. Haemophilus influenzae; Fleischmann et al., 1995) the focus of sequence analysis changed from single genes to the whole genome. A similar development can be ∗

Corresponding author. Fax: +49-521-1065626. E-mail addresses: [email protected] (A. Wilke), [email protected] (F. Meyer).

witnessed in the field of protein analysis. Instead of studying a single protein in detail, the analysis of all proteins of a cell (the proteome) becomes more and more important. The proteome comprises all the proteins present in an organism, tissue or cell at a particular time. In contrast to the genome, the proteome is not static but highly dynamic. The most common techniques used in proteomics today are two-dimensional sodiumdodecylsulfate polyacrylamide gel electrophoresis (2D SDS-PAGE)

0168-1656/$ – see front matter © 2003 Elsevier B.V. All rights reserved. doi:10.1016/j.jbiotec.2003.08.009

148

A. Wilke et al. / Journal of Biotechnology 106 (2003) 147–156

Fig. 1. A “typical” proteome experiment. Different steps of a proteome experiment are shown and divided into several stages where bioinformatics is needed to store, process, and/or analyze data. This starts with the preparation of the biological material where parameters like media composition, temperature or stress conditions need to be documented. The same holds for the extraction, separation and digestion of the protein samples. Additionally, gel images/chromatograms, machine parameters, and mass spectra need to be archived during these steps. Once the mass spectra have been obtained, they need to be analyzed with different software tools (e.g. Mascot). Storage of the results together with the experimental setup and all other data allows the user to better interpret, compare, and reproduce these results if needed.

for protein separation, and mass spectrometry (MS) which is utilized for protein identification. The improvement of these techniques has led to large-scale research in proteomics making it possible to identify almost all proteins of a given proteome (Anderson et al., 2000; Chalmers and Gaskell, 2000). A “typical” proteome experiment can be subdivided into several steps as shown in Fig. 1. First there is the isolation of proteins from biological material, e.g. a cell culture which has been grown under certain conditions. After harvesting the cells/tissue/supernatant, the sample is usually processed (e.g. washed and pulped in case of cells or concentration of a supernatant), yielding a protein mixture. The next step is the separation of this protein mixture, which is achieved by fractionation of the proteins according to their physical properties. The most common techniques currently ap-

plied are two-dimensional gel electrophoresis (Fig. 1) and high performance liquid chromatography (HPLC). Following the protein separation, fractions of interest (e.g. spots from a 2D gel) are analyzed by mass spectrometry. For this, the proteins are digested into peptides. The peptides are ionized (e.g. by matrix-assisted laser desorption/ionization (MALDI) or electro spray ionization (ESI)) and their masses are determined by time-of-flight (TOF) analysis or ion trap, yielding one or more mass spectra. The final step is the analysis of the obtained mass spectra. After processing the mass spectra (mass deconvolution, conversion of the data format, etc.) the data are analyzed with various bioinformatics tools to answer the question(s) of interest (e.g. identification of the protein(s) in the fraction, search for protein modifications).

A. Wilke et al. / Journal of Biotechnology 106 (2003) 147–156

Depending on the mass spectrometry experiment, the data is of varying informational content and can therefore be used for different purposes. In the simplest case, the peptide mass fingerprint (PMF) is used to identify proteins in a database. Therefore, the protein is broken down into characteristic components. The protein is digested with a specific enzyme into peptides and the mass of each peptide is determined. The combination of these masses is characteristic for the protein and the list of the peptide masses is called peptide mass fingerprint. This experimental PMF is compared in silico with all theoretical PMFs in a database, usually representing all proteins in a given genome. Data from tandem mass spectrometry (MS/MS) is used to enhance the precision of protein identification. This is done by measuring the mass of several peptides which are in turn fragmented one by one followed by the determination of the fragment masses. If the genome of the analyzed organism is unknown, de novo sequencing with tandem mass spectrometry can be used to determine the protein sequence of short peptides, up to 12 amino acids in length (Dan˛c´ık et al., 1999; Zhang and McElvain, 2000; Chen et al., 2001; Taylor and Johnson, 2001). With the increasing amount of data in high-throughput proteomics, it becomes more and more difficult to archive and analyze this data manually, so there is a need for a system supporting the user in data handling. For the steps represented in Fig. 1, various data sets need to be stored and analyzed using bioinformatics methods. Not only the mass spectra themselves, but also the experimental setup and various parameters have to be archived. Some of these data sets will gain importance only later, when comparing multiple proteomes such as detailed description of growth condition or parameters of data analysis. Others like the spot–spectrum relation are important in all scenarios. To gain information about a function and the biological context of a given protein, it is necessary to combine proteome data with genome and transcriptome data. Furthermore, it should be possible to mine the proteome data for hidden correlations. As experiments and data formats are changing continuously it is necessary that the system can be easily adapted to new demands. While there are various tools for solving the subproblems mentioned above (Chakravati et al., 2002; Fenyö, 2000), e.g. Mascot (Perkins et al., 1999), SE-

149

QUEST (Eng et al., 1994) or ProFound (Zhang and Chait, 2000) to identify proteins in protein databases using mass spectrometric data, no freely available tool for the support of the whole experiment exists. Besides this, there are some commercial efforts towards a proteomics platform, e.g. ProteinScape (Blüggel et al., 2002), but this software is hard to evaluate cause it is not freely available. The one aspect present in all such systems is the ability to navigate within databases via 2D gel images. Therefore, a platform was built which integrates different tools for analysis and ensures the archival of MS data together with the experimental setup (e.g. growth conditions, experimental protocols, machine parameters), measured data (e.g. 2D gels/chromatograms, mass spectra), and results of the analysis (e.g. identified proteins) in a database.

2. Platform We have implemented ProDB as a platform for the evaluation and archival of proteome experiments, using Perl1 which is a widely used programming language in bioinformatics. To ensure compatibility with tomorrows tasks, the system is based on a modular design (Fig. 2). The implementation of the system and the design of the data schema is object-oriented. As database we are using MySQL2 which is a relational database management system. To benefit from both, the object-oriented development and a relational database, we apply O2DBI (Linke, 2002). The O2DBI program package provides a persistent layer between a database and a program. This allows the programmer to use Perl objects without worrying how data in the objects is stored or retrieved from the database. Based on the capability of O2DBI to map a relational database schema to Perl objects, a data schema was developed to represent all “necessary information”. The information consists of all data concerning a proteome experiment from experimental setup to mass spectrometry to final analysis of MS spectra. This is required since the results of proteome experiments tend to be strongly dependent on all these 1 2

http://www.perl.com. http://www.mysql.com.

150

A. Wilke et al. / Journal of Biotechnology 106 (2003) 147–156

Fig. 2. Overview of the ProDB platform. The diagram shows the modular design of the system. Based on a central management layer, different modules handle data conversion and communication with external software. Parsers (triangles) for MALDI and ESI mass spectra data as well as for image data from PDQuest translate the different data formats into an internal representation. Other modules (squares) handle the bidirectional dataflow between search engines respectively other databases (GenDB) and the system. All data gathered from the various sources is stored in a relational database using O2DBI as database interface. Data manipulation as well as visualization of results is realized via a graphical user interface (GUI).

factors like culturing conditions, sample preparation, and machine parameters. The functionality provided by O2DBI is then used to build a “management layer”, which provides basic functionality by controlling the information flow between all other modules. To enter data into the system, three modules (parsers) were developed which convert data from external programs to store it into the ProDB system (triangles in Fig. 2). Supporting the most commonly employed technique for protein separation, 2D SDS-PAGE, the information generated by image analysis software (e.g. spot coordinates, spot intensities) is stored in the ProDB database. Currently supported is PDQuest3 since this software is used in our laboratory. It allows the detection of spots and the matching of multiple 2D gels. Also provided are export functions for spot positions and spot intensities for both, single gels and whole “match sets”. To allow linking of spots with MS spectra and results from database searches, this data is in turn parsed and stored in the ProDB database. Data import from other imaging software, e.g. Z3, is not yet supported, but can be

realized easily by implementing a new parser, which has to convert the exported data from the imaging software into the internal data representation. For the purpose of MS spectra acquisition, different types of mass spectrometers exist, which use different methods for ionization and mass determination. As the machines from diverse suppliers use different internal data formats, we have implemented parsers which convert the data from the different mass spectrometers into an internal data representation. This representation is designed to encompass all necessary information required by the currently available search engines. In order to analyze mass spectra which have been stored in ProDB, interaction with other types of external programs (e.g. database search engines) becomes necessary. This functionality is provided via the ProDB Tools module which allows to submit the MS data to three search engines, namely Mascot4 , SEQUEST5,6 and emowse (Pappin et al., 1993). Although the web-based data submission using the 4 5

3

http://www.bio-rad.com.

6

http://www.matrixscience.com. http://fields.scripps.edu/sequest/. http://www.thermo.com.

A. Wilke et al. / Journal of Biotechnology 106 (2003) 147–156

151

Fig. 3. The ProDB data import form. In order to start an experiment, the user needs to give some basic information. First of all, the experiment gets a unique identifier (experiment ID). Furthermore, the user has to define species and strains to be examined and the responsible experimenter, normally the user himself. In addition, a comment field allows to describe, e.g. the general purpose of the experiment. After that the user has access to other forms for defining experimental steps, e.g. cultivation or protein isolation.

HTTP protocol is considerably slower than the direct file-based approach, we implemented the former as it allows us to access not only local search engines but all freely accessible web-based search engines and their databases. This approach enhances portability as the system can also be applied by users with no access to locally installed search engines. • Mascot integrates various approaches for protein identification: PMF, mass spectra in combination with sequence data and data from tandem mass spectrometry. It uses a probability-based scoring algorithm for the identification (Creasy and Cottrell, 2002; Perkins et al., 1999). • Additionally, we use SEQUEST for tandem mass spectrometry data. Its scoring is based on a cross-correlation function calculated between the experimental spectrum and the protein sequence in the database (Fenyö, 2000).

• Emowse is part of the EMBOSS package7 (Rice et al., 2000) and based on the algorithm used in MOWSE (Pappin et al., 1993). In contrast to Mascot, SEQUEST, and ProFound, emowse is freely available. There are free Web-interfaces for the evaluation with Mascot and ProFound but these are restricted to certain general databases. ProDB Tools implements a batch mode to analyze multiple mass spectra with various sets of parameters. The results from the search engines are in turn parsed for the relevant information which is then stored together with the mass spectra and search parameters in the ProDB database. To facilitate data import and analysis, a graphical user interface (GUI) was developed. The GUI guides 7

http://www.hgmp.mrc.ac.uk/Software/EMBOSS/index.html.

152

A. Wilke et al. / Journal of Biotechnology 106 (2003) 147–156

Fig. 4. Visualization of results. The figure shows an example for one of the different result presentations. In this case, all samples from a certain experiment were selected for visualization. As second selection criterion, the user chose to display only results from searches with the Mascot search engine. The visualization is then done by producing HTML output, in this case a table listing all samples together with the best hit found using different parameter sets. To facilitate the evaluation of the results, information concerning the identified protein (proposed function, molecular weight and pI) as well as the parameters delivering the hit are listed.

the user through all steps of the experiment to enter the necessary data (gel images, mass spectra, search engine results, etc.; Fig. 3) which ensures a complete documentation of the information. Once all necessary data has been stored in the system, the user can select data sets for visualization. Possible data sets include all spots from one or more gels, all samples belonging to a certain cultivation approach, or all samples treated according to a certain experimental protocol (e.g. digestion with chymotrypsin). Search results can be represented as HTML tables as shown in Fig. 4 or linked to a browsable picture of the corresponding 2D gel. If

no 2D gel is present (e.g. in case of samples derived by HPLC fractionation), the results can also be linked to a virtual gel. In case of conflicting results based on different search engines or parameter sets, these conflicts are highlighted in the visualization. This allows the user to identify and resolve these conflicts. To combine proteome data with genome data, we have implemented an interface to GenDB (Meyer et al., 2003), an open source genome annotation system, via the BRIDGE system (Goesmann et al., 2003). This link provides additional data from the annotation (e.g. protein function, presence of signal peptides

A. Wilke et al. / Journal of Biotechnology 106 (2003) 147–156

153

Fig. 5. The depicted ProDB data schema (partial view) is the conceptual model of the cultivation in form of an UML class diagram. Different classes (rectangles) with their associations (lines) are shown. A class is described by its attributes, e.g. an organism can be specified by its name, strain and genotype. The different types of associations are marked by different arrow-heads. Unfilled rhombi denote aggregations meaning that the class pointed to contains the other class (e.g. a cultivation contains an organism). In contrast to an aggregation where the contained class can stand by its own, classes in compositions (filled rhombi) are dependent on the class they refer (point) to (e.g. a growth condition depends on a cultivation and will be deleted if the corresponding cultivation is removed from the database while the used organism remains). The sum of all classes and their associations make up the model for the cultivation. Together with other models, e.g. for 2D-gel image data, mass spectrometry, and MS data analysis these models build the basis for the object-oriented data schema.

Table 1 Comparison of manual Mascot searches to automated searches with ProDB Tools Average time (min) needed to perform a searcha with a spectrum

Inexperienced users Experienced users ProDB Toolsd a b c d

Clearb spectrum

Ambiguousc spectrum

All spectra

4±3 4±2 1.75 ± 0.25

7 ± 2.5 6±2 1.75 ± 0.25

5±3 5±2 1.75 ± 0.25

Using Mascot against a database of 3540 predicted CDS from C. glutamicum (Kalinowski et al., 2003). A spectrum yielding a significant hit according to Mascot. A spectrum yielding no significant hit even after intensive search. Using 36 different parameter sets.

154

A. Wilke et al. / Journal of Biotechnology 106 (2003) 147–156

Fig. 6. The ProDB Tools interface. To analyze one or more mass spectra, the user has to select them from one or more experiments. In addition, ranges of the search parameters are set (e.g. peptide tolerance, databases, fixed modifications). From this ranges, the underlying module ProDB Tools creates the different parameter sets which are then used to perform the actual searches.

and/or transmembrane helices, etc.) and allows the user to update the gene annotation in GenDB. This update can effect a change from predicted/hypothetical state to experimentally verified/found based on results from ProDB Tools. Additionally, to the connection to GenDB, BRIDGE provides a link to the EMMA system (Dondrup et al., 2003) which is used for the evaluation of microarray experiments. The ProDB system is currently in the beta testing phase in our department. Once a stable version is reached we will provide more information on http://www.cebitec.uni-bielefeld.de/software/prodb.

3. Results ProDB is a platform for the evaluation and archival of mass spectra and their experimental setups of proteome experiments. It facilitates the evaluation by integrating different search engines. By archiving experimental setup, mass spectra, and results of analysis

the system enables the user to compare different experiments and to mine the data for new knowledge. For this purpose, a well defined data schema of all steps of a proteome experiment (Fig. 1) was developed and described with UML8 , as shown for the cultivation in Fig. 5. A central part of the system is the automation of database searches and the subsequent storage and visualization of the results. Instead of manually changing the parameters and resubmitting the query to the search engine, the user can select parameter sets which are then processed automatically (Fig. 6). As a sample application we chose a set of 125 spectra generated during the analysis of the Corynebacterium glutamicum secretome. Using Mascot, the spectra were processed with 36 different parameter combinations and the results were visualized and exported as HTML tables (Fig. 4).

8

http://www.omg.org/uml/.

A. Wilke et al. / Journal of Biotechnology 106 (2003) 147–156

We have compared manual analysis from ten users, two of them experienced (daily users) and eight unexperienced (students), using Mascot with automated analysis with ProDB. For this purpose, each user had to evaluate his own set of 30 spectra created during analysis of different C. glutamicum proteome experiments. ProDB Tools was started with a set of 36 parameter combinations using the same samples as the users. Table 1 shows the results of this comparison. Using ProDB Tools, a significant speedup can be achieved. Furthermore, we compared the maximum scores found by the users to those found by the batch processing. It could be shown that the automated approach found a better score in about 43% of all cases (127 of 300 samples in total), yielding a significant hit in 23 instances where the users had failed, and the same score otherwise.

155

The integration of genome and transcriptome data is realized via the BRIDGE system using ProDB, GenDB, and EMMA as plugins. The system is a significant improvement compared to classic evaluation of mass spectra. In contrast to the manual protein identification with search engines, the system allows an automated analysis with ProDB Tools. This is less time consuming and increases the quality of the evaluation. Furthermore, different evaluation strategies and tools can be compared, once again elevating the chance of biological “relevant” identifications.

Acknowledgements The authors want to thank C. Eck for the time doing manual database searches needed for the comparison. Furthermore, the authors thank the unknown reviewers for their detailed and very useful comments.

4. Discussion The ProDB system integrates the analysis and storage of mass spectra with a detailed description of the experimental setup of the proteome experiments and enables the user to mine proteome data. For comparison, verification and exchange of proteome data with the community there is a need for a standard data representation. The Proteomics Standards Initiative (PSI)9 (Orchard et al., 2003) was founded to define those standards. The Proteomics Experiment Data Repository (PEDRo) (Taylor et al., 2003) is a good approach in this direction. Similar to the model used for ProDB, the recently proposed PEDRo schema is a large explicit model of proteomics experiment data. It covers the sample generation, sample processing (gels and digestion), mass spectrometry and MS results analysis. Comparing both schemata, mass spectrometry and the analysis of results is modeled similarly but the description of sample generation is more precise in ProDB. To ensure the compatibility between PEDRo and ProDB data, we will incorporate the details of the PEDRo schema which are missing in the ProDB model.

9

http://psidev.sourceforge.net.

References Anderson, N.L., Matheson, A.D., Steiner, S., 2000. Proteomics: applications in basic and applied biology. Curr. Opin. Biotechnol. 11, 408–412. Blüggel, M.G.K., Glandorf, J., Vagts, J., Reinhardt, R., Chamrad, D., Thiele, H., 2002. Proteinscape: an integrated bioinformatics platform for proteome analysis. Technical Report, Bruker Daltonics. Chakravati, D.N., Chakravati, B., Moutsatsos, I., 2002. Informatic tools for proteome profiling. Comput. Proteomics Suppl. 32, S4–S15. Chalmers, M.J., Gaskell, S.J., 2000. Advances in mass spectrometry for proteome analysis. Curr. Opin. Biotechnol. 11, 384–390. Chen, T., Kao, M.-Y., Tepel, M., Rush, J., Church, G.M., 2001. A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 8, 325–337. Creasy, D., Cottrell, J., 2002. Error tolerant searching of uninterpreted tandem mass spectrometry data. Proteomics 2, 1426–1434. Dan˛c´ık, V., Addona, T.A., Clauser, K.R., Vath, J.E., Pevzner, P.A., 1999. De novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 6, 327–342. Dondrup, M., Goesmann, A., Bartels, D., Kalinowski, J., Krause, L., Linke, B., Rupp, O., Sczyrba, A., Pühler, A., Meyer, F., 2003. EMMA: a platform for consistent storage and efficient analysis of microarray data. J. Biotechnol. 106, 135–146. Eng, J.K., McCormack, A.L., Yates, J.R., 1994. An approach to correlate tandem mass spectral data of peptides with amino

156

A. Wilke et al. / Journal of Biotechnology 106 (2003) 147–156

acid sequences in a protein database. Am. Soc. Mass Spectrom. 5, 976–989. Fenyö, D., 2000. Identifying the proteome: software tools. Curr. Opin. Biotechnol. 11, 391–395. Fleischmann, R., Adams, M., White, O., et al., 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae. Science 269, 496–512. Goesmann, A., Linke, B., Rupp, O., Krause, L., Bartels, D., Dondrup, M., McHardy, A.C., Wilke, A., Pühler, A., Meyer, F., 2003. Building a BRIDGE for the integration of heterogeneous data from functional genomics into a platform for systems biology. J. Biotechnol. 106, 157–167. Kalinowski, J., Bathe, B., Bartels, D., Bischoff, N., Bott, M., Burkovski, A., Dusch, N., Eggeling, L., Eikmanns, B.J., Gaigalat, L., Goesmann, A., Hartmann, M., Huthmacher, K., Krämer, R., Linke, B., McHardy, A.C., Meyer, F., Möckel, B., Pfefferle, W., Pühler, A., Rey, D.A., Rückert, C., Rupp, O., Sahm, H., Wendisch, V.F., Wiegräbe, I., Tauch, A., 2003. The complete Corynebacterium glutamicum ATCC 13032 genome sequence and its impact on the production of l-aspartate-derived amino acids and vitamins. J. Biotechnol. 104, 5–23. Linke, B., November 2002. O2DBI II—ein Persistenz-Layer für Perl-Objekte. Diploma Thesis. Technische Fakultät, Universität Bielefeld. Meyer, F., Goesmann, A., McHardy, A.C., Bartels, D., Bekel, T., Clausen, J., Kalinowski, J., Linke, B., Rupp, O., Giegerich, R., Pühler, A., 2003. GenDB-An open source genome annotaion system for prokaryote genomes. Nucl. Acids Res. 31 (8), 2187– 2195.

Orchard, S., Hermjakob, H., Apweiler, R., 2003. The proteomics standards initiative. Proteomics 3 (7), 1374–1376. Pappin, D.J.C., Hojrup, P., Bleasby, A.J., 1993. Rapid identification of proteins by peptide-mass fingerprinting. Curr. Biol. 3 (6), 327–332. Perkins, D.N., Pappin, D.J.C., Creasy, D.M., Cottrell, J.S., 1999. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20 (18), 3551–3567. Rice, P., Longden, I., Bleasby, A., 2000. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16 (6), 276–277. Taylor, J., Johnson, R., 2001. Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Anal. Chem. 73 (11). Taylor, C.F., Paton, N.W., Garwood, K.L., Kirby, P.D., Stead, D.A., Yin, Z., Deutsch, E.W., Selway, L., Walker, J., Riba-Garcia, I., Mohammed, S., Deery, M.J., Howard, J.A., Dunkley, T., Aebersold, R., Kell, D.B., Lilley, K.S., Roepstorff, P., Yates, J.R., Brass, A., Brown, A.J., Cash, P., Gaskell, S.J., Hubbard, S.J., Oliver, S.G., 2003. A systematic approach to modelling, capturing, and disseminating proteomics experimental data. Nat. Biotechnol. 21, 247–254. Zhang, W., Chait, B.T., 2000. ProFound: an expert system for protein identification using mass spectrometric peptide mapping information. Anal. Chem. 72 (11), 2482–2489. Zhang, Z., McElvain, J.S., 2000. De novo peptide sequencing by two-dimensional fragment correlation mass spectrometry. Anal. Chem. 72, 2337–2359.