Use of mass spectrometry-derived data to annotate nucleotide and protein sequence databases

Use of mass spectrometry-derived data to annotate nucleotide and protein sequence databases

54 Review 21 Suzuki, N. et al. (1999) Werner syndrome helicase contains a 5′–3′ exonuclease activity that digests DNA and RNA strands in DNA/DNA and...

674KB Sizes 0 Downloads 12 Views

54

Review

21 Suzuki, N. et al. (1999) Werner syndrome helicase contains a 5′–3′ exonuclease activity that digests DNA and RNA strands in DNA/DNA and RNA/DNA duplexes dependent on unwinding. Nucleic Acids Res. 27, 2361–2368 22 Shen, J-C. et al. (1998) Characterisation of Werner syndrome protein DNA helicase activity: directionality, substrate dependence and stimulation by replication protein A. Nucleic Acids Res. 26, 2879–2885 23 Soultanas, P. et al. (2000) Uncoupling DNA translocation and helicase activity in PcrA : direct evidence for an active mechanism. EMBO J. 19, 3799–3810 24 Sawaya, M.R. et al. (1999) Crystal structure of the helicase domain from the replicative helicase-primase of bacteriophage T7. Cell 99, 167–177 25 Singleton, M.R. et al. (2000) Crystal structure of T7 gene 4 ring helicase indicates a mechanism for sequential hydrolysis of nucleotides. Cell 101, 589–600

TRENDS in Biochemical Sciences Vol.26 No.1 January 2001

26 Waksman, G. et al. (2000) Helicases as nucleic acid unwinding machines. Nat. Struct. Biol. 7 , 20–22 27 Hingorani, M.M. and Patel, S.S. (1993) Interactions of bacteriophage T7 DNA primase/helicase protein with single-stranded and double-stranded DNAs. Biochemistry 32, 12478–12487 28 Jezewska, M.J. et al. (1996) Binding of Escherichia coli primary replicative helicase DnaB protein to single-stranded DNA. Long range allosteric conformational changes within the protein hexamer. Biochemistry 35, 2129–2145 29 Yu, X. et al. (1996) DNA is bound within the central hole to one or two of the six subunits of the T7 DNA helicase. Nat. Struct. Biol. 3, 740–743 30 Bird, L.E. et al. (1998) Helicases: a unifying structural theme? Curr. Opin. Struct. Biol. 8, 14–18 31 Abrahams, J.P. et al. (1994) Structure at 2.8 Å resolution of F1-ATPase from bovine heart mitochondria. Nature 370, 621–628

32 Boyer, P.D (1993) The binding change mechanism for ATP synthase some probabilities and possibilities. Biochim. Biophys. Acta 1140, 215–250 33 Hingorani, M.M. et al. (1997) The dTTPase mechanism of T7 DNA helicase resembles the binding change mechanism of the F1-ATPase. Proc. Natl. Acad. Sci. U. S. A. 94, 5012–5017 34 Stitt, B.L. (1988) Escherichia coli transcription termination protein rho has three hydrolytic sites for ATP. J. Biol. Chem. 263, 11130–11137 35 Kim, D. et al. (1999) Transcription termination factor Rho contains three noncatalytic nucleotide binding sites. J. Biol. Chem. 274, 11623–11628 36 Hingorani, M.M. and Patel, S.S. (1996) Cooperative interactions of nucleotide ligands are linked to oligomerisation and DNA binding in bacteriophage T7 gene 4 helicases. Biochemistry 35, 2218–2228 37 Bianco, P.R. and Kowalczykowski, S.C. (2000) Step size measurements on the translocation mechanism of the RecBC DNA helicase. Nature 405, 368–372

Use of mass spectrometry-derived data to annotate nucleotide and protein sequence databases Matthias Mann and Akhilesh Pandey Mass spectrometry-based proteomic methodologies can be used to annotate both nucleotide and protein sequence databases. Because such data have to be derived from proteins, they can be used to identify coding regions of the genome as well as provide the complete primary sequence of proteins and their expression patterns and post-translational modifications.

Matthias Mann Protein Interaction Laboratory (PIL), Center for Experimental Bioinformatics, University of Southern Denmark, Campusvej 55, Odense M, DK-5230 Denmark; and MDS-Protana, Staermosegaardsvej 6, Odense M, DK-5230 Denmark. e-mail: [email protected] Akhilesh Pandey Whitehead Institute for Biomedical Research, Nine Cambridge Center, Cambridge, MA 02142; and Brigham and Women’s Hospital, Boston, MA 02115, USA. e-mail: [email protected]

Complete sequences of several genomes including Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila melanogaster as well as a draft of the human genome have already been reported1–3. This has led to an information explosion and we now have more data available to us than we can manage (Box 1). One example of how these data could be handled is by use of automatic annotation programs such as Ensembl, a project launched jointly by EMBL-EBI and the Sanger Centre4. The Ensembl system is fed raw genomic data and automatically assembles the DNA fragments, predicts genes, and identifies single nucleotide polymorphisms (SNPs), repeat regions and regions homologous to other sequences in public databases. However, manual confirmation to check for errors requires that an estimated 300 person-years be spent on annotating the human genome at the sequence level5. The need to have more tools to handle such data systematically is glaringly obvious. For example, take the issue of the total number of genes in

the human genome. For years it was presumed to be approximately 80 000 –100 000 genes. It was expected that once the human genome was fully sequenced, it would be trivial to calculate the exact number of genes. Paradoxically, as we approach the finishing line with respect to sequencing, several groups have reestimated the total number of genes and their latest speculations now range from 35 000 to 120 000 (Refs 6,7). Where exactly did we go wrong? Given the stark reality that the prediction accuracy rate of gene prediction programs is approximately 40%, it is clear to see where we might have erred (see results of the Drosophila Genome Annotation Assessment Project in Ref. 8). The mouse sequencing project, which will be available to the public this year, is estimated to help tremendously in annotating coding regions in the human genome. It should be noted that prediction of genes is still a much simpler task than predicting, for example, a phosphorylation site or the function of a domain or a protein itself. Bork has come up with an interesting conclusion about most of the bioinformatics tools in use today – that they have difficulty exceeding a prediction accuracy of 70% (Ref. 9). Viewed differently, this amounts to incorrect predictions in ~30% of the time, which can be time consuming to address.

http://tibs.trends.com 0968-0004/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. PII: S0968-0004(00)01726-6

Review

TRENDS in Biochemical Sciences Vol.26 No.1 January 2001

Box 1. Major primary sequence databases

Box 2. Limitations of EST databases

• Nucleotide sequence databases of known/ predicted proteins (e.g. GenBank and EMBL) Contain nucleotide sequences of all cloned and well-characterized genes. Increasingly, these databases contain cDNA sequences corresponding to proteins that have not yet been characterized (generally deposited by cDNA sequencing consortia). A recent inclusion is proteins that are ‘totally hypothetical’ because they are generated by artificially ‘stitching together’ the exons that are predicted from the genomic DNA by genefinding programs (this information must be used with extreme caution). GenBank includes ‘finished’ highthroughput genomic sequences in this category.

• Short stretches of cDNAs They might not contain coding sequences and generally do not contain the full coding sequence.

• Protein sequences of known proteins (e.g. SWISS-PROT) Contain protein sequences of all cloned and characterized proteins. In addition, they contain translations of open reading frames of cDNAs that are predicted to code for proteins.The level of annotation of these depends upon the particular database being used. Curated databases created by human reading of the literature and subsequent entry into a database also exist (i.e. the yeast protein database;YPD). • Database of expressed sequence tags (ESTs) (e.g. dbEST) Contain single-pass nucleotide sequence reads from 5′ and 3′ ends of cDNA inserts. • Genomic databases (e.g. high-throughput genomic sequence; HTGS) Contain nucleotide sequences from genomic DNA obtained by a variety of strategies. HTGS contains ‘unfinished’ genomic sequence deposited by major sequencing centers.

Expressed sequence tag (EST) databases can be used with caution to derive useful information but still suffer from certain drawbacks (Box 2). In addition, genomic databases have their own limitations (Box 3). One way to remedy this predicament is to analyze the proteins directly. To tackle the genome, much attention has been paid to bioinformatics-based tools and the proteome has largely been ignored. The term ‘proteomics’ is any large-scale analysis of proteins and has been applied to studies ranging from identification of spots from 2D electrophoretic gels to characterization of protein–protein interactions10,11. Proteomics has been greatly facilitated by recent advances in biological mass spectrometry (MS), which has become the method of choice for protein characterization (Box 4). The major advantage of proteomic analysis is that the findings are more bona fide in the sense that they reflect physical measurements of actual proteins and http://tibs.trends.com

55

• Sequencing errors This limits the use of ESTs in establishing the reading frame that codes for the protein. Often we have found that the correct protein sequence must be obtained by ‘stitching together’ the putative translation products from all three reading frames of the EST. • Untranslated regions Because the ESTs are biased towards the 3′ end of the gene, they might only contain the 3′ UTR, precluding their use in searches based on protein homology. • Abundance The occurrence of most ESTs correlates somewhat with the abundance of mRNA level of the given gene.Thus transcripts of medium-to-low abundance might not be present in the EST databases. • Difficulty in alignment It can be difficult to align ESTs derived from the same gene due to alternative splicing or gaps in the sequence due to the fact that only the ends of cDNAs are sequenced.This is one of the reasons why the number of unique EST clusters might represent fewer genes than the number of clusters.

are not an extrapolation (such as an assertion that a sequence contains potential phosphorylation sites). Therefore, the data derived from such studies are an excellent and necessary complement to information gathered from computational approaches. Consequently, in the post-genomic era, mass spectrometry assumes one more important role – that of facilitating annotation of genomic as well as EST and protein sequence databases. These annotations are either impossible or difficult to predict using nucleotide sequence information alone – hence the analysis of proteins is a prerequisite. The analysis of proteins using mass spectrometry is generally associated with specific biological experiments. For example, inherent in the identification of a phosphorylation site in a peptide is the information that this protein was phosphorylated by treatment of a specific cell line, by a specified growth factor for a given time. See Fig. 1 for a representation of the various scenarios in which mass spectrometric analysis interfaces with biological experiments. Prediction of exons from genomic databases

Until now, identification and characterization of proteins was difficult for a variety of reasons. First,

56

Review

TRENDS in Biochemical Sciences Vol.26 No.1 January 2001

Box 3. Limitations of genomic databases • Interrupted coding regions Because the coding exons in most eukaryotes are interrupted by introns, the prediction of coding regions is not straightforward. Human genes contain the most introns and the largest introns of most mammals studied, making precise gene predictions especially difficult42. • No reading frame Because the intron–exon junctions do not have any relation to codons, there is no continuous open reading frame as is seen in cDNAs.This is because splitting of exons by introns does not preserve reading frames. • Non-directional The strand of genomic DNA that codes for a gene is not fixed. Genes can be transcribed from both strands in opposite directions. • Repeat elements The presence of repeated elements can create serious problems in alignment and analysis. Repetitive sequences are excluded when an ‘effective size’ of the human genome is calculated.

Edman sequencing, which was commonly used for identification, required large amounts of purified protein [Science’s Signal Transduction Knowledge Environment (STKE): http://www.stke.org/cgi/content/full/ OC_sigtrans;2000/37/pl1]. Second, even when such amounts were available, the relevant proteins could often not be identified because of a blocked N terminus. Mass spectrometry does not suffer from these drawbacks – it is a sensitive method for identifying proteins (femtomole levels of proteins from gels)12, can analyze mixtures (therefore, the protein does not have to be pure) and is not affected by a blocked N terminus. Spots excised from 2D gels, bands excised from 1D gels or a complex mixture of proteins in solution, can be digested by proteases such as trypsin and the peptides sequenced by tandem mass spectrometry. These peptide sequences can then be used to search the genomic databases13 to identify coding regions. Although ESTs have been used successfully to identify novel proteins13,14, it was found that the use of EST data alone does not improve overall gene prediction8,15. Some of the factors that contributed to this were problems with alignment, the occurrence of paralogs and the low quality of EST sequences. In addition, ESTs led to an increased prediction of smaller genes, thereby reducing specificity15. However, when any data based on proteins were included, a better prediction accuracy rate was obtained15. This observation makes mass spectrometry-derived peptide sequence data even more valuable as a complement to bioinformatics tools for gene prediction. Because different gene http://tibs.trends.com

prediction programs predict different exon–intron structures, such peptide sequences will help establish which ones are ‘real’. Of course, this method cannot normally be used to predict every exon, as it is difficult to achieve 100% coverage for every protein by mass spectrometry at the low levels usually available in biological experiments. The major reasons for this are: (i) some of the peptides derived from a given protein might not ionize well in the mass spectrometer and thus result in very small peaks; (ii) the peptides might be too large or too small for analysis; or (iii) the peptides might give rise to complex fragmentation spectra that are hard to interpret. Nevertheless, by using two different enzymes, close to 100% of the protein primary sequence can often be verified if sufficient material is present. Besides prediction of coding exons, it is possible to find alternatively spliced variants at the protein level based on peptide sequences that ‘span’ different exons in the genome. Several instances showing how mass spectrometry-derived data can be used to annotate genomic sequence are shown in Fig. 2. For genomic annotation we propose to use a special format of mass spectrometry in which crude protein mixtures are first reduced to peptides and the peptides then fed into the mass spectrometer after 1- or 2D chromatography16–19. Thousands of peptides can be sequenced in this approach in a fully automated fashion and with high sensitivity. We then suggest using database search software that can directly find the sequenced peptide in genomic databases, without the need for translation into the six possible reading frames (B. Küster et al., unpublished). This analysis can be done on relatively large amounts of starting material, enabling the analysis of low-copy-number proteins. Low-molecular-weight proteins

A major subset of proteins that are unlikely to be identified from the genomic sequence solely by gene prediction are genes that encode proteins containing less than 100 amino acids20. This is because geneprediction algorithms deliberately avoid prediction of such short genes by having a minimum gene size of ~100 amino acids. If such a limit is not imposed, then the number of false positives (i.e. number of potential coding regions that are actually noncoding) will increase sharply. As a result, these proteins will have to be identified by alternative methods. A major fraction of these small polypeptides are likely to be cytokines and growth factors that are secreted by cells – such proteins can be enriched from supernatants of cell lines or obtained from body fluids such as serum, urine or cerebrospinal fluid. Thus, direct sequencing of small secreted proteins by mass spectrometry will allow the discovery of small open reading frames (ORFs) in the genome. Another method is to use cDNA-based techniques to specifically isolate secreted molecules, such as the signal-sequence trap method21 or other generic methods that isolate less abundant cDNAs such as serial analysis of gene expression (SAGE)22.

Review

TRENDS in Biochemical Sciences Vol.26 No.1 January 2001

57

Box 4.Analysis of proteins by mass spectrometry • Peptide sequencing by tandem mass spectrometry Proteins are generally digested by trypsin to generate peptides that are then sequenced by mass spectrometry.The method used to sequence peptides is termed tandem mass spectrometry, in which the peptide to be sequenced is first selected from the entire peptide mixture and then fragmented by collision with an inert gas to obtain a partial or complete sequence. • Database searching using mass spectrometryderived data Mass spectrometry of tryptic digests by matrixassisted laser desorption/ionization time of flight (MALDI-TOF) provides a highly accurate list of peptide masses.This list of peptide masses is used to search a theoretical tryptic digest of all the proteins in the databases to identify the protein. The data obtained by tandem mass spectrometry (MS/MS) generally comprise the partial sequence of a peptide. This partial sequence is referred to as a ‘peptide sequence tag’. It generally contains enough information to identify the protein from databases as large as the human genomic database. If a peptide match for a given peptide sequence tag is not found, it might mean that the peptide belongs to an unknown protein or that the peptide is modified (see below). • Quantitation of protein levels Proteins or peptides are difficult to quantify in any mass spectrometer. However, if two samples are being compared, relative quantitation can be obtained by modifying the peptides in both samples (such as by adding a ‘tag’ of known molecular mass). One of the samples is modified with a naturally occurring isotope and the other with a stable isotope analog, such as deuterium-labeled tag, followed by mixing of the samples before analysis. In this case, the corresponding peptides derived from the two samples can be identified easily because they have a constant difference in mass, which is the mass of the stable isotope

Initiation codon

The translational start sites in mRNAs are not well defined. When the mRNA is translated, the ribosomes generally choose the first set of nucleotides encoding a methionine (AUG). The sequences around these AUG sequences have been proposed to affect the efficiency of translation from a particular potential start site – these sequences have been formulated into a consensus sequence for translation initiation (Kozak’s consensus)23. However, most AUGs (ATGs in cDNA) are not in complete, or sometimes any, agreement with the consensus Kozak sequence. This means that determination of the initiator ATG that http://tibs.trends.com

substitution. The difference in peak intensities then correlates directly with the abundance of the protein levels. • Identifying phosphopeptides and phosphorylation sites The peptides containing phosphates can be identified either by observing a change in molecular weight upon treatment with a phosphatase, or in a mass spectrometer, by looking for a reporter ion that is characteristic of phosphate groups (parent or precursor ion scanning). Identification of phosphorylation sites can be done in several ways by mass spectrometry. In a triple-quadrupole or quadrupole time-of-flight mass spectrometer, the location of phosphoserines or phosphothreonines can be identified in the positive mode if the mass difference between two successive fragments corresponds to phosphorylated (addition of HPO3 or 80 Da) serine residues (87 + 80 Da) or to phosphorylated threonine residues (101 + 80 Da). These residues can also be identified by a loss of 98 Da resulting from a β-elimination reaction. Phosphotyrosines are generally more stable, do not undergo the β-elimination reaction and are thus found only as a difference of 163 + 80 Da between two successive fragments. The masses of serine, threonine and tyrosine residues are 87, 101 and 163 Da, respectively. Fragmentation to obtain sequence can also be done in an ion-trap mass spectrometer, although the fragmentation pattern is slightly different from that described above. • Other post-translational modifications All modifications cause a shift in mass and therefore, in principle, are detectable by mass spectrometry. For example, the N terminus of a protein is indicated by an acetylated peptide – in this case, the N-terminal amino acid contained within the peptide is 42 Da greater than expected. Similarly, myristoylation causes an increase by 210 Da, and GalNac or GlcNAc sugars increase the mass of the corresponding peptide by 203 Da.

will be translated for a given gene might have to be done experimentally or other methods might have to be used. This could involve direct sequencing of the N terminus of the protein, prediction of initiator methionine by homology to known proteins, or guessed by the fact that there are multiple stop codons or no other ATGs upstream of a suspected initiator ATG. This implies that either a longer protein would be predicted, or conversely, a shorter version of a protein would be predicted when, in reality, an upstream ATG is being used. Using mass spectrometry, it is possible to obtain peptides that are acetylated at the N terminus – this

Review

58

TRENDS in Biochemical Sciences Vol.26 No.1 January 2001

Growth factor



In-solution digest of crude protein mixtures followed by liquid chromatography (LC) separation of peptides.

+

Excise bands

Excise spots

2D gel 1D gel

Immunoprecipitation with antibodies

Quantitation of protein levels

Condition A A

B

Excise spots specifically detected by modification-specific antibody No treatment

+ TGF-β

Mass spectrometric analysis

Database searching

Condition B

Excise differentially expressed spots

Excise bands that specifically associate with ‘bait’ protein Annotation

Sequence databases

Western blotting with modification specific (e.g. antiphosphoserine antibody) Fig. 1. Overview of various types of experiments that can be coupled to mass spectrometric analysis. The data that can be used to annotate databases include information on the ‘upfront’ experiment (such as growth factor used or the antibody used in the immunoprecipitation step), as well as the data obtained by mass spectrometry. Abbreviation:TGF-β, transforming growth factor β.

establishes it as an N-terminal peptide24. Another modification that can distinguish an N-terminal peptide from other internal peptides is a fairly frequent cleavage by aminopeptidases of the initiator methionine25. In the post-genomic era, one obvious advantage of recognizing the initiator methionine is that the translational start of a gene can then be deduced from the corresponding entry in the genome database. Signal peptides

Signal peptides are a hydrophobic stretch of amino acids that are found at the N termini of most membrane-bound receptors as well as soluble ligands. Although computer programs can predict roughly where the cleavage of this signal sequence can occur26–28, often the exact site of cleavage must be determined experimentally. Most experiments in biological mass spectrometry involve cleavage of the protein by proteases such as trypsin before analysis. Owing to the cleavage specificity of trypsin, this results in peptides that contain an arginine or a lysine residue at their C termini. This implies that when the sequence of a peptide is determined, the http://tibs.trends.com

Precipitation with control and ‘bait’ proteins

Ti BS

residue in the protein that immediately precedes the peptide must be an arginine or a lysine. An instance where this does not occur is when the peptide is derived from the N terminus of the protein. In the case of secreted proteins, the signal peptide is cleaved and the 'mature' protein is found in the extracellular environment. Therefore, when a secreted protein is being analyzed and a peptide that corresponds to the N terminus is found, the site of cleavage is automatically assigned. This information is especially critical when recombinant growth factors are manufactured for human use in certain heterologous expression systems, because inclusion of amino acids that are actually cleaved in the mature protein might elicit an immune response. Prediction of correct reading frame and stop codon

If a very short EST sequence is present in databases, it is not immediately clear whether it contains any open reading frame (coding region) and, if it does, what that reading frame is. If a peptide sequence obtained by mass spectrometry is found to correspond to an EST sequence, it might be used to assign the correct reading frame to that clone. Further, as explained above, a tryptic peptide sequence that does not end in an arginine or lysine suggests that it is

Review

1

Real protein

TRENDS in Biochemical Sciences Vol.26 No.1 January 2001

23 4

5

6

7

8

5′ UTR

3′ UTR

1

Genomic DNA

2

3

4

5

6

7

8

P

Gene prediction program 1

1 2

4

5

7

Key: 1

Gene prediction program 2

1

256

7

23 4

5

8

P

Gene prediction program 3

6

7

8

Exon 1 Intron Peptide sequence Acetylated N-terminal peptide Phosphorylated peptide Glycosylated peptide Ti BS

Fig. 2. Outputs of gene prediction programs compared with a real protein.The figure shows how mass spectrometry-derived data can be used for annotation of genomic databases. Any tryptic peptide whose sequence matches a genomic region localizes a coding exon. In addition, some peptides might be sequenced that ‘span’ two exons in the genome, thereby predicting two exons adjacent to each other (exons 6 and 7). A peptide with an acetylated N terminus or corresponding to a cleaved methionine indicates the N terminus of the protein. Similarly, a peptide that does not contain an arginine or lysine at its C terminus might indicate the C terminus of the protein.

derived from the C terminus of the protein under study. In a database search, this would align with a region just upstream of a stop codon, thus establishing the reading frame as well as the translational stop site within that clone. Post-translational modifications

The majority of proteins in cells undergo posttranslational modifications. Extracellular and transmembrane proteins are frequently glycosylated, whereas intracellular signaling molecules can be basally or inducibly phosphorylated. It is thought that at any one time approximately one-third of all proteins in eukaryotic cells are phosphorylated29. Although it is more difficult to detect and analyze post-translational modification-containing peptides, post-translational modifications are frequently discovered during large-scale proteomics experiments (reviewed in Ref. 30). At the peptide level, a peptide can be phosphorylated, acetylated or glycosylated (Box 4). However, it is more difficult to localize the exact residue that is modified, and in many instances it might not be possible at all. Nevertheless, knowledge of the relevant peptide carrying the modification might be adequate for the biologist even if the exact residue of modification is not located. Because mass spectrometry provides information about the exact mass of the post-translational modification in addition to its sequence, such information could be annotated as such in the databases for further use by the research community. Expression data

When a report on a novel gene is published, there is usually some information on tissue expression by http://tibs.trends.com

59

northern blotting to detect mRNA. Less commonly, the cell lines that express that protein are described. For investigators who are interested in studying that protein, it is often difficult to find a resource listing cell lines that express the protein. Because the samples for many mass spectrometry-based identification experiments are derived from a cell line, once the identity of the protein is determined, the entries for these proteins could be annotated in sequence databases giving the cell line of origin as the source of protein. Interesting avenues of investigation might be initiated in light of such expression information. For example, if a cytokine receptor known to be specifically expressed in hematopoietic cells is found in fibroblasts, its role in the newly discovered environment can then be tested. Newer technologies can quantify protein expression between two states using stable isotope incorporation into one of the states. Peak ratios between peptides with and without the isotope tag accurately quantify relative levels. In a special twist on this technique, using isotope-coded affinity tags (ICATs)31, hundreds of proteins can be quantified in a single experiment using only the cysteine-containing peptides. This quantitation is an improvement compared with mRNA data because mRNA levels have been found to correlate poorly with protein expression and might not be predictive of protein levels32. Analogous to the data derived from mRNA expression patterns using microarrays that are made publicly available, protein expression data could be linked to the protein sequence itself (see below). Finally, because a minority of pseudogenes can be transcribed33, if a protein is detected for a gene, then it can be presumed that it is not a pseudogene. Protein localization

Proteins, especially novel ones, are often assigned subcellular localization based on homology to other proteins or by sophisticated computer programs34. However, these predictions are generally not reliable and therefore the localization of a given protein must be confirmed by experimental methods. An interesting example is the S100 family of calcium-binding proteins that are involved in regulation of a plethora of intracellular activities such as protein phosphorylation, enzyme activities, cell proliferation and differentiation35. Based on their sequence, they are not predicted to be secreted molecules. However, some members of this family were discovered as secreted proteins, either as part of proteomic approaches to study secreted molecules or as components of saliva36,37. Similarly, proteins that are predicted to be nuclear based on the presence of nuclear localization signals might turn out to be cytosolic and vice versa. Mass spectrometric identification of specific fractions can be used to assist in annotating proteins localized to organelles such as mitochondria38 or the chloroplast39.

60

Review

TRENDS in Biochemical Sciences Vol.26 No.1 January 2001

Annotation with protein data – but how?

Acknowledgements Work at the Protein Interaction Laboratory (PIL) is supported by a generous grant from the Danish National Research Foundation to the Center for Experimental BioInformatics (CEBI) at the University of Southern Denmark. A.P. is supported by a Howard Temin Award from the National Cancer Institute (KO1 CA75447). We thank members of the PIL, especially A. Podtelejnikov, H. Steen, J. Andersen, J. Rappsilber, S-E. Ong and B. Küster for helpful discussions. We thank G. Subramanian, K. Kristiansen and M.Tewari for critical reading of the manuscript.

The submission of sequences of newly identified genes to databases is now routine. Standard web-based forms are to be filled out with the fields clearly defined by the agency that maintains the corresponding database. However, protein data cannot be submitted by users of GenBank, which is surprising given the amount of information being generated directly from proteins. This means that protein entries that are in the databases are largely ‘inferred’ from the cDNA or genomic sequences by conceptual translation or prediction. GenBank currently only allows the submitter of a given sequence to modify the data submitted. There is hardly any provision for third-party correction, let alone third-party annotation. Therefore, the data acquired using mass spectrometry (or by other proteomic methods) cannot be added to the sequence databases under the current circumstances. Data derived from analysis of proteins are dependent upon several factors. For example, although identification of a secreted protein from a cell line could be added as a simple annotation of the protein, modifications such as phosphorylation would have to specify the exact experimental conditions (such as growth factor treatment being used). Besides, a vast amount of data can be accumulated on every protein. We favor an open system of annotation whereby the annotation is performed directly by the experimenter, as in a distributed annotation system being developed by Lincoln Stein (http://stein.cshl.org/das/). In this system, a single server is designated as a reference server that contains primary information about the DNA and protein sequences as well as authorship information. Several websites can then serve as annotation servers that provide additional information on the retrieved sequence. In this scenario, no attempt is made to resolve contradictions that exist between various annotation servers. The uniqueness of this

References 1 Adams, M.D. et al. (2000) The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 2 The C. elegans Sequencing Consortium (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018 3 Goffeau, A. et al. (1996) Life with 6000 genes. Science 274, 546–567 4 Hubbard, T. and Birney, E. (2000) Open annotation offers a democratic solution to genome sequencing. Nature 403, 825 5 Claverie, J.M. (2000) Do we need a huge new centre to annotate the human genome? Nature 403, 12 6 Liang, F. et al. (2000) Gene index analysis of the human genome estimates approximately 120,000 genes. Nat. Genet. 25, 239–240 7 Ewing, B. and Green, P. (2000) Analysis of expressed sequence tags indicates 35,000 human genes. Nat. Genet. 25, 232–234 8 Reese, M.G. et al. (2000) Genome annotation assessment in Drosophila melanogaster. Genome Res. 10, 483–501

http://tibs.trends.com

system lies in the fact that the end-user can decide for him/herself which data to use. This would allow for true integration of data, unlike pointers to a weblink that contains related information on a particular sequence. One of the ways in which this could be accomplished would be to use systems similar to the software that is used to share music files over the internet40,41. To illustrate this example, the popular Napster software has a central database and server that allows the user base to link to information that still resides on the hard disks of a large user community. In a more decentralized model, the Gnutella software dispenses with the central server and connects the user community directly with each other. Applying either of these models to biology, the basic structure would be given by the genome and the gene names but massive amounts of information could be available dispersed over the internet. Obviously, highly sophisticated software would be needed to retrieve and integrate the disparate pieces of information. Conclusions and outlook

It is becoming clear that protein-derived data is essential for annotation of nucleic acid and protein sequence databases. Mass spectrometry as the proteomic tool of choice is poised to contribute greatly to the wealth of these databases. Besides the advantage of high-throughput analysis, mass spectrometry provides experimental data and is thus a perfect complement to other annotation efforts. An international initiative and collaboration is, however, critical for the long-term viability of any open annotation system. We envision that, like most other web-based initiatives (electronic publication being one of them), such a system of annotation will become popular and indispensable in the not too distant future.

9 Bork, P. (2000) Powers and pitfalls in sequence analysis: the 70% hurdle. Genome Res. 10, 398–400 10 Wilkins, M.R. et al. (1996) From proteins to proteomes: large scale protein identification by two-dimensional electrophoresis and amino acid analysis. Biotechnology 14, 61–65 11 Pandey, A. and Mann, M. (2000) Proteomics to study genes and genomes. Nature 405, 837–846 12 Wilm, M. et al. (1996) Femtomole sequencing of proteins from polyacrylamide gels by nanoelectrospray mass spectrometry. Nature 379, 466–469 13 Pandey, A. and Lewitter, F. (1999) Nucleotide sequence databases: a gold mine for biologists. Trends Biochem. Sci. 24, 276–280 14 Mann, M. (1996) A shortcut to interesting human genes: peptide sequence tags, expressed-sequence tags and computers. Trends Biochem. Sci. 21, 494–495 15 Krogh, A. (2000) Using database matches with for HMMGene for automated gene detection in Drosophila. Genome Res. 10, 523–528

16 Yates, J.R., 3rd et al. (1997) Direct analysis of protein mixtures by tandem mass spectrometry. J. Protein Chem. 16, 495–497 17 Washburn, M.P. and Yates, J.R. (2000) New methods of proteome analysis: multidimensional chromatography and mass spectrometry. In Proteomics: A Trends Guide (Blackstock, W. and Mann, M., eds.), pp. 27–30, Elsevier 18 Gygi, S.P. et al. (2000) Evaluation of twodimensional gel electrophoresis-based proteome analysis technology. Proc. Natl. Acad. Sci. U. S. A. 97, 9390–9395 19 Martin, S.E. et al. (2000) Subfemtomole MS and MS/MS peptide sequence analysis using nanoHPLC micro-ESI fourier transform ion cyclotron resonance mass spectrometry. Anal. Chem. 72, 4266–4274 20 Rudd, K.E et al. (1998) Low molecular weight proteins: a challenge for post-genomic research. Electrophoresis 19, 536–544 21 Tashiro, K. et al. (1993) Signal sequence trap: a cloning strategy for secreted proteins and type I membrane proteins. Science 261, 600–603 22 Velculescu, V.E. et al. (1997) Characterization of the yeast transcriptome. Cell 88, 243–251

Review

23 Kozak, M. (1992) Regulation of translation in eukaryotic systems. Annu. Rev. Cell Biol. 8, 197–225 24 Wold, F. (1981) In vivo chemical modification of proteins (post-translational modification). Annu. Rev. Biochem. 50, 783–814 25 Ben-Bassat, A. et al. (1987) Processing of the initiation methionine from proteins: properties of the Escherichia coli methionine aminopeptidase and its gene structure. J. Bacteriol.169, 751–757 26 von Heijne, G. (1983) Patterns of amino acids near signal-sequence cleavage sites. Eur. J. Biochem. 133, 17–21 27 Claros, M.G. et al. (1997) Prediction of N-terminal protein sorting signals. Curr. Opin. Struct. Biol. 7, 394–398 28 Nielsen, H. et al. (1997) Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 10, 1–6 29 Zolnierowicz, S. and Bollen, M. (2000) Protein phosphorylation and protein phosphatases. De Panne, Belgium, 19–24 September, 1999. EMBO J. 19, 483–488

TRENDS in Biochemical Sciences Vol.26 No.1 January 2001

30 Jensen, O.N. (2000) Modification-specific proteomics: systematic strategies for analysing post-translationally modified proteins. In Proteomics: A Trends Guide (Blackstock, W. and Mann, M., eds), pp. 36–42, Elsevier 31 Gygi, S.P. et al. (1999) Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol. 17, 994–999 32 Gygi, S.P. et al. (1999) Correlation between protein and mRNA abundance in yeast. Mol. Cell. Biol. 19, 1720–1730 33 Mighell, A.J. et al. (2000) Vertebrate pseudogenes. FEBS Lett. 468, 109–114 34 Nakai, K. (2000) Protein sorting signals and prediction of subcellular localization. Adv. Protein Chem. 54, 277–344 35 Donato, R. (1999) Functional roles of S100 proteins, calcium-binding proteins of the EF-hand type. Biochim. Biophys. Acta 1450, 191–231 36 Madsen, P. et al. (1991) Molecular cloning, occurrence, and expression of a novel partially secreted protein ‘psoriasin’ that is highly

37

38

39

40 41

42

61

up-regulated in psoriatic skin. J. Invest. Dermatol. 97, 701–712 Kojima, T. et al. (2000) Human gingival crevicular fluid contains MRP8 (S100A8) and MRP14 (S100A9), two calcium-binding proteins of the S100 family. J. Dent. Res. 79, 740–747 Patterson, S.D. et al. (2000) Mass spectrometric identification of proteins released from mitochondria undergoing permeability transition. Cell Death Differ. 7, 137–144 Peltier, J.B. et al. (2000) Proteomics of the chloroplast: systematic identification and targeting analysis of lumenal and peripheral thylakoid proteins. Plant Cell 12, 319–341 Butler, D. (2000) Music software to come to genome aid? Nature 404, 694 Butler, D. and Smaglik, P. (2000) Draft data leave geneticists with a mountain still to climb. Nature 405, 984–985 Deutsch, M. and Long, M. (1999) Intron–exon structures of eukaryotic model organisms. Nucleic Acids Res. 27, 3219–3228

Life-or-death decisions by the Bcl-2 protein family Jerry M.Adams and Suzanne Cory In response to intracellular damage and certain physiological cues, cells enter the suicide program termed apoptosis, executed by proteases called caspases. Commitment to apoptosis is typically governed by opposing factions of the Bcl-2 family of cytoplasmic proteins. Initiation of the proteolytic cascade requires assembly of certain caspase precursors on a scaffold protein, and the Bcl-2 family determines whether this complex can form. Its pro-survival members can act by sequestering the scaffold protein and/or by preventing the release of apoptogenic molecules from organelles such as mitochondria. Pro-apoptotic family members act as sentinels for cellular damage: cytotoxic signals induce their translocation to the organelles where they bind to their pro-survival relatives, promote organelle damage and trigger apoptosis.

Jerry M.Adams* Suzanne Cory# The Walter and Eliza Hall Institute of Medical Research, P O Royal Melbourne Hospital, Melbourne 3050, Australia. *e-mail: [email protected] #e-mail: [email protected]

Apoptosis, the stereotypic program of cellular suicide, removes unwanted cells throughout life, and its disrupted regulation is implicated in disorders ranging from cancer and autoimmune diseases to degenerative syndromes. The Bcl-2 family of cytoplasmic proteins plays a central regulatory role. Its interacting pro- and anti-apoptotic members integrate diverse upstream survival and distress signals to determine whether the cellular death warrant is issued. Many of the key players have been identified, and the spotlight is now on the stage where their ‘dance of death’ commences: the surface of organelles such as the mitochondria where Bcl-2 family members either reside or congregate during apoptosis. The hotly debated and still unresolved issue of how their control is exerted is the focus of this and other recent reviews1–7.

The central pathway to death

The Bcl-2 family regulates an ancient path to cell death (Fig. 1), found in organisms as diverse as mammals, nematodes and fruitflies. The route culminates in the scission of critical target proteins by proteases of the caspase group, but only after traversing critical checkpoints. To preclude unscheduled cell suicide, each caspase is synthesized as a minimally active precursor and generation of the active enzyme requires its processing, at sites of caspase cleavage. The effector caspases are processed by ‘upstream’ caspases, but the apical caspase that sets up the execution must process itself. The autocatalysis requires multimerization, aided by an adaptor or scaffold protein such as the nematode protein CED-4, which primes autoactivation of caspase CED-3, or the mammalian CED-4 homolog Apaf-1, which primes autoactivation of procaspase-9 (Refs 4,6,7). The Bcl-2 family determines whether or not the multimeric scaffold/procaspase complex, often termed an ‘apoptosome’, can assemble. Antiapoptotic members such as Bcl-2 and its nematode counterpart CED-9 prevent apoptosome formation, but their life-saving activity can be foiled by other relatives such as Bim or EGL-1 (see below). Caspases can also be controlled downstream of Bcl-2 (Fig. 1). The IAP (inhibitor of apoptosis) proteins appear to directly block caspase activity

http://tibs.trends.com 0968-0004/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. PII: S0968-0004(00)01740-0