Protein sorting signals and prediction of subcellular localization

Protein sorting signals and prediction of subcellular localization

PRO CHEM V54 - AP - 4997 / C9-277 / 03-27-00 09:49:23 PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION KENTA NAKAI Human Genome Cen...

810KB Sizes 52 Downloads 133 Views

PRO CHEM V54 - AP - 4997 / C9-277 / 03-27-00 09:49:23

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION KENTA NAKAI Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639 JAPAN

I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II. Sorting of Bacterial Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. Signal Peptides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. Topogenesis of Membrane Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. Sorting Specific for Gram-Negative Bacteria . . . . . . . . . . . . . . . . . . . . . . E. Sorting Specific for Gram-Positive Bacteria . . . . . . . . . . . . . . . . . . . . . . . F. Prediction of Localization in Bacterial Cells . . . . . . . . . . . . . . . . . . . . . . III. Sorting of Eukaryotic Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. Signal Peptides and Membrane Proteins . . . . . . . . . . . . . . . . . . . . . . . . . C. Lipid Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. Nucleocytoplasmic Transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E. Mitochondrial Targeting Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F. Peroxisomal Targeting Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G. Chloroplast Transit Peptides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H. Sorting via Transport Vesicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I. Endoplasmic Reticulum, Golgi Apparatus, and Secretory Pathway . . . J. Lysosome/Vacuole and Endocytic Pathway . . . . . . . . . . . . . . . . . . . . . . . K. Miscellaneous Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L. Prediction of Localization in Eukaryotic Cells . . . . . . . . . . . . . . . . . . . . . IV. Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

277 278 278 280 289 296 299 299 302 302 303 304 308 311 315 316 319 321 323 327 328 330 331

I. INTRODUCTION Recent advances in large-scale sequencing has accelerated the identification of potential genes. To find the function of these genes, the homology search technique has been routinely used. However, there always remains a significant fraction of genes (open reading frames, ORFs) without any hits in the databases. Further, it is often the case that even a hit in the database search does not produce any useful information because so many unannotated sequences are now stored. Prediction of subcellular localization sites of such potential gene products can be useful to get some indication of their function because cellular functions are often localized in specific compartments. For example, if a protein is localized at the nucleus, its function is likely to be 277 ADVANCES IN PROTEIN CHEMISTRY, Vol. 54

Copyright 䉷 2000 by Academic Press. All rights of reproduction in any form reserved. 0065-3233/00 $30.00

4997 / C9-278 / 03-27-00 09:49:23

278

KENTA NAKAI

related to DNA. Even bacteria have several localization sites within the cell. Thus, the prediction of protein subcellular localization is useful to screen candidate genes for drug discovery, for example. It is also an interesting and challenging problem to automatically annotate the localization information for all hypothetical gene products identified in a genome (Eisenhaber and Bork, 1998). Because the information determining the subcellular localization site of a protein is encoded in its amino acid sequence in most cases, the prediction of subcellular localization sites is of great theoretical interest as an interpretation of genetic information. The localization information is usually represented as a short sequence segment called a protein sorting signal. Some of these signals are represented as well-defined motifs, whereas others show rather vague sequence features that are hard to detect by simple homology searching. Moreover, many of the signals should be interpreted within the context of the information; for example, a simple motif indicative of the endocytosis should be meaningful only when it is placed in the cytoplasmic tail of type I membrane proteins, which should have several signals to form this type of membrane topology. Therefore, prediction methods of localization sites should be developed based on the wealth of knowledge on the protein sorting processes produced by extensive studies on cell biology. This review summarizes the knowledge in regard to protein sorting signals and the current status of predictive work. It is intended to be a practical guide for both those who want to interpret their own sequence data and those who want to develop new prediction methods. Considering this purpose, review articles rather than the original references are often cited for further reading. In addition, information that is rather species- or gene-specific is not included. The basic story remains unchanged since a review on a similar theme was published in 1991 (Nakai, 1991). Emphasis has been placed on the great advances in our understanding of protein sorting mechanisms.

II. SORTING OF BACTERIAL PROTEINS A. Overview Until quite recently, general protein sorting mechanisms have been studied using only a few organisms; i.e., most experiments were done on the system of Escherichia coli or on some mammalian secretion systems. However, owing to the accumulation of entire genome sequence data of various organisms and the comparative genomic studies based on

4997 / C9-279 / 03-27-00 09:49:23

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

279

them, the situation is changing. In other words, we are beginning to realize what is common in the protein-sorting mechanisms of various organisms. For example, even the smallest known genome of Mycoplasma genitalium, which contains only about 470 genes, turned out to have the machinery for lipid modification (Fraser et al., 1995), which appears somewhat exceptional. The effectiveness of comparative genomics even holds for the analyses of archaea, i.e., organisms belonging to the third domain of life. We know that the membrane translocation apparatus of archaea is a mix of eukaryotic homologs and bacterial ones (Pohlschro¨ der et al., 1997). Like eukaryotic and bacterial proteins, archaeal proteins are also exported according to the information of signal peptides. The details of the protein sorting systems of archaea are not discussed here. Historically, bacteria have been classified into two categories using a staining method developed by C. Gram: gram-positive and gramnegative. The difference of the staining patterns comes from the difference of the envelope structure between the two categories (Fig. 1). Gram-positive bacteria, such as Bacillus subtilis, have only one membrane, the cytoplasmic membrane, with a surrounding thick cell wall of peptidoglycan and teichoic acids (Navarre and Schneewind, 1999). On the other hand, gram-negative bacteria, such as E. coli, have two membranes: the inner membrane (also called the cytoplasmic membrane), which is similar to the membrane of gram-positive bacteria, and the outer membrane (Duong et al., 1997). The aqueous space between these two membranes is called the periplasm, which also contains a thinner layer of peptidoglycan. Some bacteria also have several other appendages such as fimbriae and pili, which are related to cell adherence, and flagella, which are related to chemotaxis. They are not described here

FIG. 1. Localization sites of (a) gram-positive and (b) gram-negative bacteria.

4997 / C9-280 / 03-27-00 09:49:23

280

KENTA NAKAI

because their sorting processes seem rather specialized (Aizawa, 1996; Thanassi, 1998). Therefore, gram-positive bacteria have three distinct protein localization sites: the cytosol (the cytoplasm), the cytoplasmic membrane, and the exterior space of the cell, where the proteins are secreted. If we count the cell wall as an independent site, the number becomes four. Similarly, gram-negative bacteria have five localization sites: the cytosol, the inner membrane, the periplasm, the outer membrane, and the exterior space. At first glance, the sorting process of gram-positive bacterial proteins appears rather simple: a protein will be secreted to the outside if it has a signal peptide at its N terminus; it will be integrated into the cytoplasmic membrane if it has a transmembrane segment(s); otherwise, it will remain within the cytosol. There is also a general sorting pathway for the cell-wall proteins. The sorting process of gram-negative bacterial proteins can also be summarized as follows: A protein will pass through the inner membrane if it has a signal peptide but does not have an additional transmembrane segment(s); if it has transmembrane segments, it will be integrated into the inner membrane. Unfortunately, the sorting mechanisms between the periplasm, the outer membrane, and the outside medium are not yet fully understood; but a signal for outer membrane proteins has been proposed. It can be misleading when the word ‘‘secretion’’ has been often used for specifying a mere translocation across the inner membrane in gram-negative bacteria. Sometimes, the word ‘‘excretion’’ is used to specify the processes in which proteins are moved to the outside medium (although the word ‘‘secretion’’ is preferred for specifying this process in this review). Both types of bacteria have lipoproteins, i.e., proteins that have a covalently linked lipid moiety. Lipoproteins have a slightly different type of signal peptide, and in gramnegative bacteria they are further sorted either to the periplasm or the outer membrane. Some bacterial proteins are localized asymmetrically (e.g., at a restricted portion of the membrane), but this subject is not discussed here (Nelson, 1992; Shapiro, 1993). B. Signal Peptides Signal peptides (also called signal sequences or leader sequences) are the amino-terminal extension of polypeptides that direct them to and across the cytoplasmic membrane in prokaryotes and the endoplasmic reticulum (ER) membrane in eukaryotes. Some people distinguish the terms ‘‘signal peptide’’ and ‘‘signal sequence,’’ depending on its cleavage. It has long been believed that only one translocation pathway that utilizes signal information exists, namely the SecB-dependent pathway

4997 / C9-281 / 03-27-00 09:49:23

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

281

in prokaryotes and the signal recognition particle (SRP)-dependent pathway in eukaryotes. Surprisingly, recent studies have revealed that there are several pathways for the translocation across the cytoplasmic/ER membrane. Moreover, several classes of signal peptides direct the differential selection of these pathways. In addition, signal peptides are recognized by several factors. Therefore, the molecular mechanisms related to signal peptides are rather complex. 1. Sorting Pathways a. SecB-Dependent Pathway. As mentioned previously, the most wellcharacterized pathway in both gram-positive and gram-negative bacteria is the SecB-dependent pathway (also called the general secretory pathway, GSP) (Fekkes and Driessen, 1999; Danese and Silhavy, 1998; Ito, 1996). In this pathway, a cytosolic chaperone, SecB, first recognizes a target preprotein (i.e., a protein with a signal peptide). Whether or not SecB specifically recognizes the signal peptide is controversial. Then, this preprotein translocates across the cytoplasmic membrane through an aqueous gated pore with the aid of another helper protein, SecA (Schekman, 1994). The gate is called the translocon, which includes SecY, SecE, and SecG proteins and possibly others. These processes are posttranslational; in other words, the translocation of a protein is not coupled with its translation. It is suggested that most periplasmic and outer membrane proteins use this pathway. b. SRP-Dependent Pathway. In the eukaryotic translocation system, a ribonucleoprotein complex, SRP, plays an important role; SRP is also important in bacteria (de Gier et al., 1997; Fekkes and Driessen, 1999). In E. coli, SRP consists of the Ffh protein and the 4.5S RNA. From analyses of eukaryotic systems, it is believed that SRP first interacts with a nascent polypeptide emerging from a ribosome. If SRP recognizes a signal peptide on the peptide, it pauses the translation and brings the peptide to the SRP receptor (FtsY in E. coli) on the cytoplasmic membrane. The subsequent translocation process in this pathway uses the SecYEG translocon as well as the translocation process in the SecBdependent pathway. SecA also seems to be involved in the process. In this sense, this pathway can be regarded as a branch of the Sec pathway. However, in this case, the process is believed to be cotranslational; that is, the translation is coupled with the translocation. This pathway translocates mainly the inner membrane proteins in E. coli (Ulbrandt et al., 1997). c. TAT-Dependent Pathway. As described in Section III, G, 3, the translocation machinery of the thylakoid membrane in chloroplasts is evolu-

4997 / C9-282 / 03-27-00 09:49:23

282

KENTA NAKAI

tionally related to the Sec-dependent localization system in bacteria. Recently, the third pathway of bacterial protein export was identified, which turned out to be functionally related to the ⌬pH-dependent import pathway across the thylakoid membrane in chloroplasts (Santini et al., 1998; Weiner et al., 1998). The pathway is designated as the TAT (twin-arginine translocation) pathway because the proteins transported by this pathway have signal peptides with a characteristic pattern of double arginine residues in the amino-terminal region (see Section II,B,2,c) (Berks, 1996). The details of its molecular mechanism have not been clarified, but it is independent from the Sec system, and the components of the system include the products of the tatABCD operon and the tatE gene (Sargent et al., 1998). Homologous gene products also seem to exist in chloroplasts (and possibly in mitochondria) (Bogsch et al., 1998). In bacteria, the TAT-dependent pathway seems to be mostly utilized by a variety of periplasmic redox cofactor-binding proteins, such as proteins binding iron–sulfur clusters, proteins binding the molybdopterin cofactor, and enzymes with polynulcear copper sites (Berks, 1996). The observations that this pathway translocates proteins posttranslationally and that folded/oligomerized proteins seem to pass through the membrane ‘‘as is’’ are consistent with the notion that cofactorbinding proteins are assembled with their cofactors in the cytosol before their translocation (Settles and Martienssen, 1998). d. Unknown or Targeting Factor-Independent Pathways. The number of translocation pathways in bacteria is not known. Some small proteins may spontaneously translocate across the membrane. For example, a synthetic signal peptide is spontaneously inserted into the model lipid system (Briggs et al., 1985) and an eukaryotic protein, prepromelittin, can be inserted into the ER membrane, at least independently of SRP and its receptor (Muller and Zimmermann, 1987). Since SecB may bind to the mature part of a preprotein rather than to the signal peptide, it seems possible that proteins without an (amino-terminal) signal peptide can be targeted via the SecB-dependent pathway. An E. coli protein complex, HflKC, which is a heterodimer of HflK and HflC, has signal anchor sequences and translocates across the membrane without SecB and SRP but via the Sec translocase complex (Kihara and Ito, 1998). There is also a report that the mutation of the SecY gene enables E. coli to export proteins that lack signal sequences and may remain unfolded in the cytoplasm (Prinz et al., 1996). There are also several specialized mechanisms for the secretion of proteins into the outside medium (see Section II,D,4).

4997 / C9-283 / 03-27-00 09:49:23

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

283

2. Sequence Features In recent years, a surprise in the study of signal peptides has been that signal peptides can be recognized by different factors, and the differences of their sequence features determine the preferences to these factors (Zheng and Gierasch, 1996; Siegel, 1995). a. Classical Features. It has been established that a concrete consensus sequence does not occur in signal peptides; rather a three-region structure is conserved: the n-region, the h-region, and the c-region (Fig. 2) (von Heijne, 1985). The most amino-terminal n-region often contains positively charged residues, i.e., arginine(s) or lysine(s); the central hregion is the hydrophobic core, and its length ranges about 7 to 15 residues; the carboxy-terminal c-region contains more polar residues than the h-region, and there is a weak consensus pattern specifying the cleavage site. This structure is well conserved across different genes and species. For example, signal peptides of B. subtilis proteins have the same structure, but their n-region tends to be longer than that of E. coli proteins (Simonen and Palva, 1993; Nagarajan, 1993). b. SRP Dependency. It is now clear that we should classify these signals into several classes corresponding to the previously mentioned pathways (Martoglio and Dobberstein, 1998; Fekkes and Driessen, 1999). The features of each type of signal, including related signals discussed in the following sections, are summarized in Table I. Although it seems evident that the mature domain of preproteins can affect the selection of sorting pathways, it also seems too early to discuss its general nature here. The sequence differences between the SecB-dependent (SRP-independent) signals and the SRP-dependent signals appear rather subtle. However, there is a tendency for the h-region of the SRP-dependent signals to be more hydrophobic in both bacterial and yeast systems (de Gier et al., 1998; Ng et al., 1996). Increasing the hydrophobicity of an h-region can change the dependency of the translocation system. In addition, it

FIG. 2. Tripartite structure of a signal peptide.

4997 / C9-284 / 03-27-00 09:49:24

284

KENTA NAKAI

TABLE I Types of Signal Peptides Signal SRP-dependent

SRP-independent/ SecB-dependent TAT-dependent

SPase IIdependent Signal Anchor I Signal Anchor II

Features 18–26 ‘‘amino acids’’ (aa) in length; mostly positive n-region, hydrophobic h-region, and c-region harboring (⫺3, ⫺1) consensus for cleavage; majority in higher eukaryotes Similar to SRP-dependent, but length and/or hydrophobicity of h-region is smaller; also used at endoplasmic reticulum (ER) Longer in length (26–58aa);‘‘twin-arginine’’ motif in n-region; also ‘‘Sec-avoidance’’ lysine in c-region; not found at ER but found at chloroplasts Type II signal sequence; used for lipoproteins; ‘‘LA(G/A) C’’ motif for cleavage in c-region Forms opposite NexoCcyt orientation; few or no charges in n-region; longer h-region is favored than in type II anchor NcytCexo orientation like ordinary signal peptides; no (⫺3, ⫺1) motif or longer h-region than ordinary signal peptides (but shorter than that of type I anchor); positively charged n-region

is likely that the net charge of the n-region also affects the selectivity of the two pathways because the net charge affects the translocation efficiency, and the degree of hydrophobicity of the h-region can compensate for it, more or less. The conformation of signal peptides also seems to be important. It has been postulated that the kink within the hregion may facilitate the translocation of signal peptides (Matoba and Ogrydziak, 1998). c. ‘‘Twin-Arginine’’ Motif. The signal peptides directing the TATdependent pathway have some characteristic features (Berks, 1996). In general, these signals are long, 26 to 58 residues, whereas the typical range of the ‘‘Sec’’-type signals is between 18 and 26 residues (Fekkes and Driessen, 1999). This difference is due mostly to the extension of the n-region and partly to the h-region but, as an overall tendency, the TAT-targeting signal is less hydrophobic than the Sec-targeting signal (Cristo¨ bal et al., 1999). Moreover, the peptides possess a ‘‘twin-arginine’’ motif immediately upstream of the h-region. The consensus sequence of this motif is ‘‘(S/T)RRX⌽⌽,’’ where ‘‘X’’ represents an arbitary residue and ‘‘⌽’’ represents a hydrophobic residue. Even the change of an ‘‘R’’ (arginine) into ‘‘K’’ (lysine) can destroy the signal, and the presence of the two hydrophobic residues is also essential, at least in chloroplasts (Chaddock et al., 1995; Brink et al., 1998).

4997 / C9-285 / 03-27-00 09:49:24

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

285

The addition of a ‘‘twin-arginine’’ motif may not be sufficient to convert usual signal peptides into ‘‘TAT-dependent’’ ones. There is a type of mature protein that is ‘‘Sec-incompatible.’’ Furthermore, there seems to be a ‘‘Sec-avoidance’’ motif, which is a lysine residue in the cregion. Combined incorporation of the twin-arginine and Sec-avoidance signals seems to convert the Sec-dependent signals into TAT-dependent ones in chloroplasts (Bogsch et al., 1997). Most bacterial proteins that use the TAT pathway bind cofactors, which may interact with the translocation machinery. However, other types of proteins also seem to use this pathway. So, Sec-dependent and TATdependent pathways compete with each other. Increasing the hydrophobicity of a TAT-directing signal can convert it into a Sec-directing signal (Cristo¨ bal et al., 1999). As described in Section II,B,1, this pathway seems to allow folded and oligomerized proteins to be translocated across the membrane. Therefore, it is not always necessary for a subunit protein to have the signal if another has one (so-called hitchhiker or piggy-backing mechanism) (Rodrigue et al., 1999). In this case, the prediction of its localization site would be inherently difficult. 3. Specificity of Signal Peptidase In many cases, signal peptides of preproteins are cleaved off after their translocation. Otherwise, the hydrophobic segment of signal peptides remains inserted across the membrane, anchoring their mature part. In this case, the signal is called a ‘‘signal anchor’’ (Martoglio and Dobberstein, 1998). There are two kinds of signal anchors based on their orientation (see Section II,C,1). The cleavage of signal peptides is performed by membrane-bound enzymes, called signal peptidases (Pugsley, 1993; Dalbey et al., 1997). Two types of signal peptidases are known. The class I signal peptidase (also called signal peptidase I) cleaves ordinary signal peptides. The peptidases at the mitochondrial inner membrane and the thylakoid membrane of chloroplasts, as well as the signal peptidase on the ER of eukaryotic cells, also belong to this class. The class II signal peptidase specifically cleaves the signal peptides of lipoproteins after the cysteine residue on the signal peptide is modified with fatty acids. The substrate specificity of the type I signal peptidases is known as the (⫺3, ⫺1) rule observed at the c-region of signal peptides (von Heijne, 1984; Jain et al., 1994), where the residues at positions ⫺3 and ⫺1 from the cleavage site (i.e., cleavage occurs at the peptide bond between ⫺1/⫹1 positions) are usually small (and neutral) residues, such as alanine. Recently, the x-ray crystallographic structure of the signal

4997 / C9-286 / 03-27-00 09:49:24

286

KENTA NAKAI

peptidase I bound to an inhibitor was determined (Paetzel et al., 1998). The structure explains the requirement of the (⫺3, ⫺1) rule. In addition, it shows that the c-region must be in an extended conformation. This observation has an important implication on the nature of cleavage (von Heijne, 1998)—namely, the signal peptidase does not cleave the transmembrane segments of membrane-integrated proteins nor artificial signal peptides with extended h-region (Nilsson et al., 1994). Since the transmembrane segments are usually more hydrophobic than h-regions (as described in Section III,B,1), they may fail to locate the potential cleavage site at an appropriate spatial position. Other factors such as the length of the n-region can influence the cleavage, and so it is difficult to predict the cleavage event from the mere presence of the (⫺3, ⫺1) pattern. The features of signal-anchor sequences are discussed later in the context of membrane protein topogenesis (see Section II,C,2). The signal sequences of lipoproteins are often called ‘‘type II signal sequences’’ because they are cleaved by class II signal peptidases (signal peptidases II). They harbor a somewhat different consensus pattern in the c-region; most important, the residue at the ⫹1 position must be cysteine, which is modified with fatty acids, and the residue at the ⫺3 position tend to be large hydrophobic residues, such as leucine. A typical consensus is represented as ‘‘LA(G/A)兩C,’’ where ‘‘兩’’ is the cleavage site (Klein et al., 1988; von Heijne, 1989). Processing of lipoproteins by the signal peptidase II also exists in gram-positive bacteria. According to a recent report, signal peptidase II in B. subtilis processes the signal peptide of 움-amylase, which is a nonlipoprotein (Tjalsma et al., 1999). Thus, the specificity of signal peptidase II may not be confined to lipoproteins. 4. Prediction Methods Prediction of the presence or the absence of a signal peptide in a given amino acid sequence may not always be a well-defined problem. For example, a fraction of plasminogen activator inhibitor-2 exists at the cytosol despite the presence of amino-terminal signal peptide (Belin et al., 1996). It seems that each signal peptide shows its own degree of efficiency. However, the detection of signal peptides is useful, and many prediction methods have been developed (Claros et al., 1997). Most of them can also predict the cleavage site of signal peptide. Because the basic structure of signal peptides is common between bacteria and eukaryotes, all prediction methods can be applied to each category of data although the differences of the optimized, numeric parameters exist. Certainly, a method with high accuracy would be desirable for practical uses. However, it is difficult to compare the perfor-

4997 / C9-287 / 03-27-00 09:49:24

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

287

mances of various methods from the literature simply because each method uses different training data and a different evaluation method (Nielsen et al., 1999). In principle, newer methods are favorable because they likely use newer training data, but the description of classical methods is also useful for understanding basic concepts. Therefore, this section introduces two new publicly available methods: SignalP (Nielsen et al., 1997) and SignalP-HMM (Nielsen and Krogh, 1998), as extensions of two classic methods, McGeoch’s method (McGeoch, 1985) and von Heijne’s method (von Heijne, 1986b). A brief introduction of other methods (Folz and Gordon, 1987; Ladunga et al., 1991; Arrigo et al., 1991; Schneider and Wrede, 1993) can be found in a review by Claros et al. (1997). a. Window-Search Methods. One of the most basic methods for detecting a certain sequence pattern is by constructing a weight matrix (also called a position-specific score matrix) of the pattern, scanning given sequences by it, and picking high-scoring positions (Durbin et al., 1998). This method uses a window of a fixed length and tries to find sequence positions that fit well to the model specified by it. The weight matrices constructed from the compilation of known signal peptides by von Heijne are typical examples of such an approach (von Heijne, 1986b). Two kinds of matrices were constructed from the data of bacteria and of eukaryotes. Because the length of signal peptides varies, the standard position was taken at the cleavage site; positions from ⫺13 to ⫹2 were taken and the contributions from their upstream positions are neglected. Therefore, it is likely that the matrices mostly contain information from the h-region and the c-region. The (⫺3, ⫺1) rule plays an especially important role in them. The method was originally proposed for the detection of cleavage sites, but it is also useful to detect the presence of signal peptides because there are few signal peptides without any cleavage sites (i.e., signal anchors). More recently, a method based on a much more sophisticated technique and larger training data was proposed (Nielsen et al., 1997), but it still uses a window of fixed length. The technique involves a neural network method, which iteratively adjusts many numeric parameters to give a favored distinction from two sets of data (Baldi and Brunak, 1998). The authors created three kinds of predictors: one for gram-positive bacteria, one for gram-negative bacteria, and one for eukaryotes. In each predictor, two kinds of neural networks were used; one network calculates a score (the S score) that represents the tendency of a given segment (of length 19 and 27 for bacteria and eukaryotes, respectively) to be part of signal peptide. The other network calculates another score

4997 / C9-288 / 03-27-00 09:49:24

288

KENTA NAKAI

(the C score) that represents the tendency of a given segment (of length ranging from 13 to 23) to be a cleavage site. The final score (the Y score) is calculated as a geometric average of the C score and a numeric derivative of the S score. The updated version of this method gives a prediction accuracy of 72.4%, 83.4%, and 67.5% for eukaryotes, gramnegative bacteria, and gram-positive bacteria, respectively (Nielsen et al., 1999), in locating the cleavage site. However, one should not take these percentages as accurate when applied to an unknown proteome. One reason is that the compositions of both positive and negative data are different. Another is that it is generally difficult to automatically predict the start codon of potential gene products. Nevertheless, this SignalP method seems to be the most reliable method currently available and is widely used through the Internet. b. Recognition of Tripartite Structure. Another approach for the detection of signal peptides is to detect the tripartite (three-domain) structure of the signal. A classic method on this approach was presented by McGeoch (McGeoch, 1985). In his algorithm, the boundary between the n-region and the h-region is searched within the amino-terminal 12 residue segment. Then the following h-region was defined as the subsequent uncharged region. Lastly, the length of the h-region and the degree of hydrophobicity of the 8-residue maximal hydrophobic region were combined to detect the presence of signal peptides. This method was later included in the global prediction system called PSORT (see Sections II,F,2 and III,L,2) using discriminant analysis (Nakai and Kanehisa, 1991; Nakai and Kanehisa, 1992). In this implementation, another variable, the net charge of the n-region, was also added. At that time, McGeoch’s method showed better predictability than that of von Heijne for predicting the existence of signal peptides, although McGeoch’s method cannot predict the cleavage site. In PSORT, the methods were combined to detect the signal anchors. The resulting parameters were later optimized to the data of B. subtilis and Saccharomyces cerevisiae (Nakai, 1996). Recently, the hidden Markov model (HMM) was applied to the prediction of signal peptides (Nielsen and Krogh, 1998). HMM is a probabilistic technique that can be suited to model various aspects of sequence data (Durbin et al., 1998). It has been successfully used in the field of speech recognition and in molecular biology for gene-finding and motif representation. Using an HMM, the tripartite structure of signal peptides can be modeled naturally, and a general algorithm can be used to scan given sequences with the obtained model. Moreover, in this program (designated SignalP-HMM), the distinction between the cleavable signal

4997 / C9-289 / 03-27-00 09:49:24

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

289

peptide and the (type II) signal anchor is also possible by assuming that a sequence segment can be classified into one of three states: a cleavable signal peptide, a signal anchor, or some other ordinary sequence. Although, the prediction accuracies for detecting signal peptides are comparable between SignalP and SignalP-HMM, SignalP-HMM is inferior to SignalP in the accuracy of cleavage-site prediction in an objective test (Nielsen et al., 1999). C. Topogenesis of Membrane Proteins A significant fraction of proteins in the cell is integrated into the membrane. Identifying whether a protein is a membrane protein is useful to deduce its function. Additional structural information on which part(s) of its sequence is inserted across the membrane, as well as the information on the orientation, i.e., at which side of the membrane its N terminus resides, is also desirable. Such an orientation is called the membrane topology. In principle, some sorting signals must protrude from a specific side of the membrane; otherwise their subcellular receptor molecules cannot access them. Therefore, the prediction of membrane topology is a prerequisite for the prediction of the subcellular localization site. Other types of membrane-associated proteins are not entirely integrated but are located near the membrane. Such proteins are called the peripheral membrane proteins. Although such information would also be useful, their discrimination from soluble proteins is difficult and the term ‘‘membrane proteins’’ is used here to specify only the membrane-integrated type. A distinct class of peripheral membrane proteins, proteins anchored by their lipid moiety, is discussed later. Almost all membrane proteins have their own topology that is uniquely determined by its amino acid sequence information. However, some exceptional proteins have dual orientations (Dunlop et al., 1995). In another protein, the orientation is altered in vivo for functional reasons (Bruss et al., 1994; Prange and Streeck, 1995). The molecular mechanisms of the topogenesis of membrane proteins are not fully understood, but some important findings are described next. 1. Folding Types and Topology a. Folding Type. So far, the majority of membrane proteins with known three-dimensional structures belong to a single class; all of their transmembrane (also called membrane spanning) segments are 움 helices comprised of apolar residues. Thus, most of the studies on the biogenesis of membrane proteins have been on this class of proteins. In another class of membrane proteins, all the transmembrane segments are com-

4997 / C9-290 / 03-27-00 09:49:24

290

KENTA NAKAI

prised of 웁 strands (Cowan and Rosenbusch, 1994; von Heijne, 1995). This type of protein has been discovered in a limited number of membranes, such as the outer membrane of gram-negative bacteria. Their sorting mechanism and prediction are discussed in Section II,D,2. It is not known how many structural classes of membrane proteins exist. A new type of structure has been found in the acetylcholine receptor (Hucho et al., 1994; Miyazawa et al., 1999). This section focuses on the topogenesis of all-움 types, which includes most current knowledge. b. Classification of Topology. The topology of membrane proteins has been classified in various ways. They are sometimes confusing because the same or similar names can represent different types. One consensus for a protein with a single membrane-spanning domain (a bitopic protein) is that it is called type I if its N terminus is located on the extracytoplasmic side (Nexo/Ccyt) and type II if its N terminus is located on the cytoplasmic side (Ncyt/Cexo). This discussion uses the definition introduced by Spiess (1995), which is an extension of the definition by von Heijne and Gavel (1988) (Fig. 3). In this definition, type I proteins

FIG. 3. Classification of single-spanning membrane proteins based on topology. (a) The ‘‘loop model’’ for explaining the biogenesis of type I topology in the translocon. The stop-transfer signal stops the integration. (b) Type I protein and a cleaved signal peptide. (c) Type II (NcytCexo) is made by a type II signal-anchor. (d) Type III (NexoCcyto; often called type I) is made by a type I signal-anchor. (e) Type IV (C-tail) is made independently from the translocon.

4997 / C9-291 / 03-27-00 09:49:24

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

291

have a narrower meaning; namely, they have a cleavable amino-terminal signal peptide and one additional transmembrane segment. After cleavage, the N terminus of its mature part is located on the extracytoplasmic side. Conversely, proteins that have an uncleavable signal peptide and take the NexoCcyt are called type III. This type of signal peptide is also called a type I signal anchor, and the type II signal anchor takes the protein to the NcytCexo configuration, which is called the type II topology (consistent with the preceding consensus; see Table I). The type IV membrane protein forms an unusual class that has no signal peptide nor signal anchor, but one transmembrane segment near the C-terminus (Kutay et al., 1993). Note that this classification does not cover all the theoretical possibilities. For example, a cleavable signal peptide cannot direct the creation of a topology opposite to a type I (according to the loop model). Why other types are not observed must be explained from their insertion mechanisms. This classification can be naturally expanded into multispanning proteins (polytopic proteins) based on the location of their N terminus (therefore, type IV cannot be defined). For the record, according to the classification of Singer (1990), type I proteins are called type Ia, and type III proteins are called type Ib. In Singer’s definition, type III proteins represent multispanning proteins and type IV represents a water-filled channel (defined for porins). Furthermore, according to Howell and Crine (1996), type IV represents multimers of subunits, type V represents proteins that are anchored to the membrane by a covalently linked lipid moiety only, and type VI represents those anchored both by a transmembrane domain and the glycosylphosphatidylinositol (GPI) anchor (see Section III,C,3). 2. Mechanical Issues The details of the molecular mechanisms of membrane protein biosynthesis have not been fully clarified. Related to the different processing pathways of signal peptides, there seem to be multiple topogenic pathways. However, although some clear differences are reported (Gafvelin et al., 1997), the basic mechanisms of both prokaryotic and eukaryotic systems do not differ as much as previously expected. The next section describes a rather simplified view common to prokaryotic and eukaryotic systems. Some excellent reviews on this theme have already been published (Sakaguchi, 1997; von Heijne, 1997; Hedge and Lingappa, 1997; Matlack et al., 1998; Bernstein, 1998). a. The Players and the Stage. Like secreted proteins, membrane proteins also use the translocon gate for their integration (the Sec61p

4997 / C9-292 / 03-27-00 09:49:24

292

KENTA NAKAI

complex in eukaryotes and the SecYEG complex in prokaryotes). Like eukaryotic membrane proteins, most multispanning inner membrane proteins are likely to use the SRP-dependent system (Section II,B,1) (Ulbrandt et al., 1997; de Gier et al., 1998). Because many membrane proteins do not have cleavable signal peptides, it is likely that their (most amino-terminal) internal transmembrane segment is recognized as a signal anchor by SRP in such cases. When it is a type I signal anchor, the most amino-terminal segment preceding the signal anchor is translocated across the membrane. In this case, the segment is called the N-tail (Dalbey et al., 1995). In prokaryotic systems, the translocation mechanism of N-tail proteins has not been well characterized, but the Sec-translocation system seems to be used (McMurry and Kendall, 1999). Longer N-tails tend to be avoided in prokaryotes. The SRP-dependent pathway of protein translocation is thought to occur cotranslationally; i.e., the nascent polypeptide emerging from the ribosome is inserted into the translocon using this translation process as its driving force. Moreover, there is a growing list of evidence that the ribosome plays an important role in this process (Siegel, 1997; Bibi, 1998). It may even recognize the transmembrane segments and might be considered as a subunit of the translocon complex. Other proteins also participate in the integration process. One class is composed of molecular chaperones such as SecA in bacteria and Hsp70 or BiP in eukaryotes (Qi and Bernstein, 1999; Schekman, 1994; Mothes et al., 1997; Hamman et al., 1998; Pilon and Schekman, 1999). Another important player in the eukaryotic system is TRAM (translocating chain-associating membrane protein) (Walter, 1992). b. Charge Effects. From statistical analyses of amino acid sequences of bacterial inner membrane proteins, the importance of positively charged residues located on loops had been proposed (von Heijne, 1986a; von Heijne and Gavel, 1988) and was subsequently shown experimentally (Boyd and Beckwith, 1990; Andersson et al., 1992). More specifically, the segments facing the cytoplasm contain more arginines and lysines than the segments facing the periplasm (the ‘‘positive-inside’’ rule) (von Heijne, 1994). Changing the amount of positively charged residues can make the overall topology reverse or can even leave out one of the transmembrane segments from the membrane (Gafvelin and von Heijne, 1994). Similar but slightly different effects are also observed on eukaryotic membrane proteins (Gafvelin et al., 1997); notably, the effects of positive charges on internal loops are weaker in the eukaryotic system. Not only positively charged residues but also negatively charged residues may affect the topology, especially in eukaryotes (Wahlberg and Spiess,

4997 / C9-293 / 03-27-00 09:49:24

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

293

1997; Kiefer et al., 1997). Moreover, the amphiphilic nature of charge distribution, where charged residues tend to locate at one side of a helix while hydrophobic residues tend to locate at the other side, may affect the topogenesis (Seligman and Manoil, 1994). Note that the positive-inside rule is consistent with the observation that most signal peptides have a few positively charged residues in the n-region because the N terminus of a signal peptide faces toward the cytoplasmic side of the membrane. The opposite functions of type I/ type II signal anchors can be explained from this rule to some extent; generally speaking, type I signals tend to have fewer positive charges in the n-region and a longer h-region, whereas type II signals tend to have more positive charges in the n-region and a shorter h-region (Beltzer et al., 1991; Sakaguchi et al., 1992; Wahlberg and Spiess, 1997). Alternatively, the net charge balance between the flanking regions of an hregion on both sides may be important (see later). The molecular basis of the positive-inside rule is unknown. However, negatively charged phospholipids can control the membrane protein topology, suggesting part of the mechanism (van Klompenburg et al., 1997). c. Another Possible Mechanism. Another element that may affect the biogenesis of membrane proteins is the STE (stop transfer effector), which was discovered from the studies on prion (Yost et al., 1990) and apolipoprotein B (Chuck and Lingappa, 1992). STE is a short stretch of basic and hydroxylated residues that causes a ‘‘pause transfer’’ of the translocation (Chuck and Lingappa, 1993; Nakahara et al., 1994). However, its general role in the membrane protein assembly is still unknown. d. Models of Membrane Integration. The positive-inside rule seems to explain the topology of membrane proteins in most cases when they have a single spanning segment. However, it does not always explain the topology of multispanning proteins. Whether a membrane protein is integrated into the membrane cotranslationally or posttranslationally is thought to be important. If the integration is cotranslational and if all membrane-spanning regions are well defined (i.e., sufficiently hydrophobic), the sole determination of the orientation of the most amino-terminal segment will automatically determine the total topology. Because eukaryotic membrane proteins seem to be SRP-dependent, they are cotranslationally integrated. Thus, the charge balance between the flanking regions on both sides of the

4997 / C9-294 / 03-27-00 09:50:21

294

KENTA NAKAI

most amino-terminal transmembrane segment was believed to be important (Hartmann et al., 1989). According to a simple cotranslational model, transmembrane segments can have two different kinds of function: the start-transfer signal and the stop-transfer signal (Kuroiwa et al., 1991; Kuroiwa et al., 1996). Start-transfer signals are the internal type II signal-anchors that direct the translocation of subsequent loops, whereas stop-transfer signals stop the translocation by anchoring themselves at the membrane. It was postulated that a transmembrane segment next to a start-transfer signal works as a (type II) stop-transfer signal and vice versa. Thus, a multispanning protein is ‘‘stitched’’ into the membrane one segment after another. On the other hand, in the case of bacterial proteins, the global balance of net positive charges located on the loops between both sides of the membrane has been emphasized (von Heijne, 1992). In the model for eukaryotic proteins, however, it is hard to explain the existence of rather hydrophilic transmembrane segments. Moreover, experimental evidence that denies the sequential model is accumulating (Gafvelin et al., 1997; Ota et al., 1998a; Ota et al., 1998b). Based on these results, a new model for eukaryotes, which is more similar to the previous prokaryotic model, has been proposed (Ota et al., 1998b). According to the new model, a membrane-spanning segment can work not only as a type II signal anchor but also as a type I signal anchor. In the latter case, a preceding weakly hydrophobic (cryptic) segment can be integrated into the membrane. Unlike in the sequential model, in this model the character of a segment as a signal is not determined solely by that of its upstream segment; in a sequential model, a segment next to a stoptransfer signal is thought to act as a signal anchor. Rather, each segment is likely to have its degree of signal character that is more or less determined by its local conditions, such as its hydrophobicity and flanking charged residues, but the characters of neighboring segments are not totally independent. This model appears more realistic than the previous model, and it will be interesting to construct a prediction method based on such a model. 3. Prediction Methods Objective assessments of currently available prediction methods are rather difficult because there are relatively few membrane proteins with known membrane-spanning segments (especially including their boundary information) and topology. Several tests have been attempted on newly determined structures, but they do not give an average performance for each prediction method (Turner and Weiner, 1993). Only

4997 / C9-295 / 03-27-00 09:50:21

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

295

a very brief description of the typical prediction methods is presented here. a. Prediction of Transmembrane Segments. Transmembrane segments are usually characterized as apolar and largely hydrophobic segments. The success of the hydropathy plot by Kyte and Doolittle (1982) proved the validity of this view, at least to some extent. A consensus pattern of the transmembrane segments of type I proteins was proposed, but its general validity is unclear (Landolt-Marticorena et al., 1993). A practical way to improve the prediction accuracy is to use the multiple alignment of a family of sequences, if any (Persson and Argos, 1994). A similar approach was taken with a more complicated technique of a profile-based neural network (Rost et al., 1995). This method claimed an accuracy of 95%, and it has even been improved by refining the output of the network by incorporating global information (Rost et al., 1996). Incorporation of global aspects also enables the program to predict the overall topology. Another simple prediction method using a specialized dot plot to detect weak similarities between nonhomologous membrane proteins has been proposed (Cserzo¨ et al., 1997). The strength of this method is its simplicity; neither the information of a protein family nor the positive-inside rule was used. Such a method is also expected to be useful to analyze ‘‘unusual’’ proteins. Recently, another prediction method, in which termini of potential transmembrane segments are precisely calculated, was proposed (Pasquier et al., 1999). b. Prediction of Topology. A classic method for predicting the topology of eukaryotic membrane proteins was proposed by Hartmann et al. (1989). This method assumes the stitching model and only evaluates the charge difference between the two sides of the most amino-terminal transmembrane helix, which must be predicted beforehand. Another classic method for predicting the topology of prokaryotic proteins is by von Heijne (1992). It extensively uses the idea of the positive-inside rule. In this method, both the prediction of transmembrane segments and the topology are predicted simultaneously, in the sense that less-hydrophobic segments are predicted to be membrane spanning only when the total positive charge balance across the membrane of the model is improved. Later, a similar approach was used to predict eukaryotic proteins incorporating the information of amino acid composition of long loops (Sipos and von Heijne, 1993), but its performance was not as good as that of the prokaryotic system. The algorithms were implemented into the TopPred II program, which is freely distributed.

4997 / C9-296 / 03-27-00 09:50:21

296

KENTA NAKAI

The idea of finding the best model was extended by Jones et al. (1994). A dynamic programming algorithm was used to select the most plausible model, and the same authors also presented an ambitious method to predict the three-dimensional structure of the 움-helical membrane proteins (Taylor et al., 1994). Finally, HMMs were used to model the overall structure of the membrane topology by two groups of researchers (Sonnhammer et al., 1998; Tusna´ dy and Simon, 1998). Although different techniques are used, all of these methods, except for Hartmann et al., search for the most plausible model. Thus, the definition of a reasonable target function seems to be the key issue. D. Sorting Specific for Gram-Negative Bacteria The biogenesis of a gram-negative bacterial envelope requires subsequent sorting mechanisms for its component proteins after their translocation across the inner membrane (Duong et al., 1997; Danese and Silhavy, 1998). In addition, some proteins such as proteases and toxins are secreted through the outer membrane to the extracellular space. Three major secretion pathways have been characterized (Salmond and Reeves, 1993). 1. Periplasmic versus Outer Membrane Proteins Whether or not they are lipoproteins, both periplasmic proteins and outer membrane proteins translocate across the inner membrane; thus there should be some cellular mechanisms that sort them. Unlike inner membrane proteins, outer membrane proteins do not have characteristic hydrophobic transmembrane segments; as such, most, if not all, of them are thought to be composed of 웁 strands. Moreover, it has been suggested that such conformation may be the determinant of the integration into the outer membrane; in other words, these proteins may be spontaneously integrated into the outer membrane. If this assumption is correct, the outer membrane proteins must fold at the periplasm. Another possibility is that the outer membrane proteins are integrated at certain sites where the inner and outer membranes are contacted. This issue has not been solved, but a recent experiment supports the periplasmic folding (Eppens et al., 1997). There is a report that the signal for the outer membrane assembly resizes at the C termini of proteins (Struyve´ et al., 1991; de Cock et al., 1997). That is, most C-terminal residue must be phenylalanine in the outer membrane proteins. Experiments using mutant proteins showed that C-terminal phenylalanine is important for the efficient assembly of PhoE.

4997 / C9-297 / 03-27-00 09:50:21

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

297

Recently, a periplasmic protein Skp, which selectively binds to a set of nonnative outer membrane proteins, was discovered (Chen and Henning, 1996; de Cock et al., 1999). Skp is proposed to act as a chaperone that helps outer membrane proteins to be efficiently targeted to the outer membrane in vivo. 2. Prediction of 웁-Type Membrane Protein Structure If the 웁-rich conformation of outer membrane proteins is really the determinant of their localization, the prediction system of protein localization should evaluate the possibility of an input protein being the 웁 type. Fortunately, this appears easier than ordinary secondary structure prediction of globular proteins. Several authors have proposed prediction methods. Here, a method that is conceptually simple and two other recently published methods are briefly described. In the method by Schirmer and Cowan (1993), a kind of hydrophobicity plot like the hydropathy plot of Kyte and Doolittle (1982) is used. The amino acid index representing hydrophobicity is modified to emphasize the effect of aromatic residues. Considering the structural feature of 웁 strands, the averaged value of 4 positions (i ⫺ 2, i, i ⫹ 2, and i ⫹ 4 for position i) is taken, and the plot is drawn for both even- and odd-numbered positions. The peaks correspond well to the observed positions of the 웁 strands. In a recent method by Gromiha et al. (1997), each position is scored according to the five aspects of 웁 structure: the preference to be in membrane 웁 strand (contributions both from the position only and from the average of neighboring 6 residues are considered); a hydrophobic parameter (again, two kinds of contributions are considered); and the amphiphilicity. Highly scored positions are regarded as the nucleus of the structure formation. Then, the region is extended in both directions until a low-score position appears on both sides. Another method was proposed by Diederichs et al. (1998). This method is very simple in the sense that it trains a neural network using amino acid sequences as inputs and the z coordinate of C움 atoms in a coordinate frame with the outer membrane in the xy plane, as outputs. The performances of these methods have not been compared. 3. Sorting of Lipoproteins Bacterial lipoproteins are anchored at the membrane by their covalently linked lipid moiety. Although they are first anchored at the inner membrane on their synthesis, some portion of them are then transfered to the outer membrane. Therefore, some sorting machinery must exist. It has been revealed that there is a specific pathway that includes the

4997 / C9-298 / 03-27-00 09:50:21

298

KENTA NAKAI

LolA periplasmic chaperone and the LolB outer membrane receptor (Matsuyama et al., 1995; Matsuyama et al., 1997). It seems that the sorting signal of this pathway is the residue next to the amino-terminal cysteine residue of the mature part, where a fatty acid is attached. If this second residue is aspartate, the protein will remain on the inner membrane; otherwise, it will be transfered to the outer membrane, although some additional structural context can affect its destiny (Yamaguchi et al., 1988; Gennity et al., 1992). 4. Protein Secretion Pathways Many bacteria secrete a wide range of proteins including pathogenic factors such as toxins. They must pass through both the outer and inner membranes. There are various mechanisms for protein secretion. Among them, three pathways are conserved in many species of gram-negative bacteria (Salmond and Reeves, 1993; Nunn, 1999). a. ABC-Mediated Pathway. The type I pathway is also called the ABC (ATP-binding cassette) pathway because its molecular machinery includes the ABC transporter (Binet et al., 1997). Because it is independent from the Sec pathway, the proteins using this pathway do not possess a signal peptide. In spite of the Sec independence, the SecB chaperone is involved in this pathway (Delepelaire and Wandersman, 1998). The transport occurs without creating any periplasmic intermediates. All proteins using this pathway (except the bacteriocins) seem to have a carboxy-terminal secretion signal of about 60 residues, which is specifically recognized by the ABC protein. In addition, most of them have glycine-rich repeats (‘‘GGXGXD’’) close to their carboxy terminus. Of these C-terminal signals, the last 15 residues are especially important. Although there is no significant similarity between the C-terminal signals, many of them have a characteristic motif at their C-terminal end. The motif consists of a negatively charged residue followed by three to five hydrophobic residues, such as ‘‘DVID.’’ A recent nuclear magnetic resonance (NMR) study suggests that the motif must be in a flexible and unstructured state, which helps the ABC transporter to access it (IzadiPruneyre et al., 1999). b. Other Pathways. The type II pathway uses the Sec-dependent, general secretory pathway (GSP). It is probably the major pathway in gramnegative bacteria. The transported proteins have a cleavable signal peptide and are transported to the periplasm and fold, in the first step. The translocation across the outer membrane requires a specific molecular machinery, but the sequence determinants for selection are not well

4997 / C9-299 / 03-27-00 09:50:21

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

299

understood. Type II protein export is related to pilus biogenesis (Nunn, 1999). The type III pathway is used by some pathogens (Mecsas and Strauss, 1996; Alfano and Collmer, 1997). Using this pathway, virulence factors are delivered directly to the cytoplasm of host cells when the bacterium contacts with the cell. The (uncleaved) signal on this pathway is likely to be located on the N-terminal region, but no common features have been found (Michiels and Cornelis, 1991). In the Yop proteins of Yersinia species, these signals appear to be recognized in the mRNA level, as frameshift mutations do not prevent their secretion (Anderson and Schneewind, 1997). Type III secretion system is related to flagellar biogenesis (Kuwajima et al., 1989; Young et al., 1999). E. Sorting Specific for Gram-Positive Bacteria As described in Section II,A, gram-positive bacteria have only one membrane (the cytoplasmic membrane). Therefore, the translocation through the Sec pathway directly leads proteins to be secreted (Simonen and Palva, 1993; Nagarajan, 1993). The issue of protein sorting into the cell wall is described in a separate section. 1. Cell Wall Sorting In the cell wall of gram-positive bacteria, many surface proteins are covalently anchored to the cell wall. A universal molecular mechanism for this process is conserved in a wide range of species. The target proteins have an N-terminal signal peptide, an ‘‘LPXTG’’ motif, a carboxy-terminal hydrophobic domain, and a charged tail (Schneewind et al., 1992; Schneewind et al., 1993). With the signal peptide and the hydrophobic domain, the target forms a type I topology in the first step. Then, a proteolytic cleavage occurs in the motif: ‘‘LPXT 兩 G,’’ where ‘‘兩’’ stands for the cleavage site. Next, its soluble part is linked to the wall peptidoglycan. An extensive review on this topic has been published (Navarre and Schneewind, 1999). F. Prediction of Localization in Bacterial Cells 1. General Aspects Staden (1999) classified gene-finding approaches into ‘‘gene search by signal,’’ and ‘‘gene search by content.’’ The former approach is to find genes like subcellular molecular machinery, i.e., searching for genes from our knowledge on promoters, terminators, start and stop codons,

4997 / C9-300 / 03-27-00 09:50:21

300

KENTA NAKAI

etc. On the other hand, in the latter approach, some statistical features that are, for example, the by-products of specifying codons are examined. The ‘‘search by content’’ is generally more powerful than the ‘‘search by signal,’’ perhaps reflecting the lack of total understanding of the signal-recognition processes. Quite similarly, in the field of localization prediction, there are two approaches: ‘‘prediction by signal’’ and ‘‘prediction by content.’’ In the former scheme, prediction is made based on the knowledge of various sorting signals, whereas the second prediction is made based on the statistics such as the deviation of amino acid composition. With the ‘‘prediction by signal’’ approach, the real sorting processes are simulated, more or less. This can be useful to verify the generality of current knowledge. The main drawback is that our knowledge is still incomplete. More than one sorting pathway is directing proteins to a specific site. A protein can even have its own sorting machinery, and there is a hitchhike mechanism in which only one subunit within a protein complex has the sorting signal. Another potential problem of this approach is that it requires full-length precursor sequences as inputs because partial sequences may lack some sorting signal(s). It is especially problematic on the systematic annotation of ORFs found in a genome because their start codons may be incorrect (Nielsen et al., 1999). On the other hand, the ‘‘prediction by content’’ approach is applicable regardless of the variety of sorting pathways. It may be applied to partial sequences, which are now massively produced day by day. In addition, this approach allows a simple and unified treatment, which is convenient for objective testing (e.g., cross validation). However, there is no guarantee that the amino acid composition of proteins in each localization site is well conserved. Even when a clear tendency is observed for a known set of proteins, it can be an artifact resulting from the deviation of data because the size of known proteins for each site is often insufficient to perform reliable statistical analyses. It is also evident that this approach cannot handle the differences among isoforms with different localization (see Section III,K,3). Both types of prediction methods exist, but since the methodology is the same for most systems in both prokaryotic and eukaryotic data, the signal-based methods, which includes the knowledge specific to prokaryotes, is described here. Other methods are described in the section on eukaryotes. 2. Prediction by Signal Information In a pioneering attempt, Sjo¨ stro¨ m et al. (1987) performed a multivariate data analysis on the N-terminal signal peptides of E. coli proteins.

4997 / C9-301 / 03-27-00 09:50:21

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

301

Their analysis showed that the signal peptides are characteristic for each localization site. However, it may not be a general phenomenon because the data size was rather small and because such differences do not seem to be used as sorting signals in E. coli. In 1991, Nakai and Kanehisa presented a new prediction system (PSORT) for the localization sites in gram-negative bacteria (Nakai and Kanehisa 1991). This system is based on a technique of artificial intelligence, an expert system. Expert systems have a unit known as a knowledge base, which contains a set of knowledge for solving a specific problem, and an inference-engine unit, which exploits the knowledge base and enables the system to solve the problem by ‘‘deduction’’ (see appropriate textbooks on artificial intelligence for more information). Some typical knowledge bases are implemented as a collection of ‘‘if-then’’ type rules, which are called production rules. The PSORT system is equipped with two groups of rules. The first group of rules call various subprograms and store the results in the so-called working memory, whereas the second group of rules combine these results to make the final prediction. The subprograms include McGeoch’s method and von Heijne’s method for signal peptide prediction (McGeoch, 1985; von Heijne, 1986a), the methods of Klein et al. (1985, 1988) for predicting lipoproteins and transmembrane segments; and the observation of Yamaguchi et al. (1988), on lipoprotein sorting. That the sorting signals between periplasmic and outer membrane proteins were not known was a problem. Therefore, the differences of the amino acid composition of the predicted mature parts between the two groups were compared. Their difference was quite impressive, and even the protruding segments of outer membrane proteins are possibly detected based on this character (Nakai, 1991). PSORT was later expanded for gram-positive bacteria, yeasts, animals, and plants (Nakai and Kanehisa, 1992). Although it has become a rather old program, PSORT is still widely used via the Internet. One problem with PSORT is that it uses many numeric parameters that cannot be optimized to a given set of training data (i.e., sequences with known localization sites). Thus, a way to optimize these parameters by some machine-learning technique has been sought. Horton and Nakai (1996) proposed a new probabilistic reasoning model that showed significantly better predictability, both in E. coli and yeast data. Later, a simpler and well-known algorithm, the k-nearest neighbor method, produced even better results (Horton and Nakai 1997). This was surprising because this algorithm ignores the inherent hierarchy between various signal-recognition events (e.g., signal peptides should be recognized first) and treats all variables equally. Its eukaryotes version (PSORT II) has been released, and its source code is distributed free of charge

4997 / C9-302 / 03-27-00 09:50:21

302

KENTA NAKAI

(Nakai and Horton, 1999). Note that most of the subprograms in PSORT II have not been upgraded from the original PSORT. New discoveries on sorting signals stated in this review should be incorporated. III. SORTING OF EUKARYOTIC PROTEINS A. Overview In spite of the variety of appearances of eukaryotic cells, their intracellular structures are essentially the same. Because of their extensive internal membrane structure, however, the problem of precise protein sorting for eukaryotic cells becomes much more difficult than that for bacteria. Figure 4 schematically illustrates this situation. There are various membrane-bound compartments within the cell. Such compartments are called organelles. Besides the plasma membrane, a typical animal cell has the nucleus, the mitochondrion (which has two membranes; see Fig. 6), the peroxisome, the ER, the Golgi apparatus, the lysosome, and the endosome, among others. As for the Golgi apparatus, there are more precise distinctions between the cis, medial, and trans cisternae, and the TGN (trans Golgi network) (see Fig. 8). In typical plant cells, the chloroplast (which has three membranes; see Fig. 7) and the cell wall are added, and the lysosome is replaced with the vacuole.

FIG. 4. A schematic illustration of the membrane structure of a hypothetical eukaryotic (animal) cell.

4997 / C9-303 / 03-27-00 09:50:21

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

303

Except for a small number of proteins that are coded in the genomes of mitochondria and chloroplasts, all other proteins are synthesized in the cytosol. Further, most of them are thought to have their sorting signal(s) as a part of their own (precursor) amino acid sequences. Such signals are recognized by some molecular machineries, and the proteins are transported to their localization sites. Thus, it is likely that we can also interpret their amino acid sequence and can predict their localization sites using knowledge of such signals. Of course, the actual situation is complex. There are multiple pathways for each localization site. Some of the signals seem to be represented as a specific conformation and their sequence features are hard to recognize. Knowledge of protein sorting mechanisms and signals is still limited. However, it is undoubtedly a problem worth challenging in this age of genome-sequencing and postsequencing projects. The next section summarizes what is known regarding the protein sorting mechanisms of eukaryotic cells, emphasizing the knowledge on sorting signals. A few published prediction methods for them are also described. B. Signal Peptides and Membrane Proteins A protein can be transported to the target site in two ways (see Fig. 8). In one way, proteins are transported directly within the cytosol. In the other way, they are confined into the transport vesicles, and the vesicles are finally fused to the target membrane. The latter pathway is called the vesicular pathway (or the secretory pathway, although vesicular pathways includes the other endocytic pathway). This separation depends on whether or not a protein is recognized by the SRP or not. As described in the sections on bacteria, SRP recognizes the N-terminal signal peptides and the signal anchors in (integral) membrane proteins. 1. Eukaryote-Specific Aspects of Signal Peptides Most of the arguments described in the sections on bacterial signal peptides and membrane proteins seem to be valid for the eukaryotic systems, as well as the translocation phenomena across the ER membrane (Sakaguchi, 1997). They seem to be also true for the translocation system across the mitochondrial inner membrane protein into the intermembrane space and the system across the thylakoid membrane in chloroplasts. Although the TAT-dependent pathway has not been found in the ER, it exists on the thylakoid membrane (and possibly on the inner membrane of mitochondria). The architecture of signal peptides including the (⫺3, ⫺1) rule also holds in eukaryotes, although there are small differences of preferred residues (von Heijne, 1986b; Nielsen et al., 1997).

4997 / C9-304 / 03-27-00 09:50:21

304

KENTA NAKAI

In the ER translocation system, most of mammalian proteins are likely to use the SRP-dependent pathway, whereas in yeast the SRPindependent pathway as well as the SRP-dependent pathway are heavily used. The SecB-dependent pathway in bacteria seems to correspond with this SRP-independent pathway, which is posttranslational. Instead of SecB, various proteins including BiP, Sec62p, and Sec63p are involved. The specificity for the preceding two pathways is determined by the hydrophobicity of signal peptides (Ng et al., 1996; Zheng and Gierasch, 1996); signal peptides of proteins preferring the SRP-dependent pathway tend to be more hydrophobic. It is noteworthy that the efficiency of signal peptides varies, but it does not correlate with their binding affinity to the SRP complex, owing to the existence of a second signal recognition event at the ER (Belin et al., 1996; Siegel, 1995). In short, a signal peptide is more informative than previously expected. 2. Membrane Proteins As described in Section II,C,2, some differences exist between the bacterial and eukaryotic systems on the multispanning membrane assembly (Gafvelin et al., 1997); however, they also have many points in common: the multispanning membrane proteins are likely to be cotranslationally integrated (Ulbrandt et al., 1997), and both systems use homologous translocon channels, which play an important role for the topogenesis of these multispanning membrane proteins (Prinz et al., 1998). C. Lipid Anchors Of the extremely diverse examples of protein modifications observed in eukaryotic cells, the modifications by lipid (and glycolipid) molecules are of special interest because lipid-attached proteins can be anchored at the membrane, although all of these proteins are not always anchored. So far, three groups of membrane anchoring proteins have been noted (Fig. 5). 1. Myristoylation and Palmitoylation Some long-chain fatty acids are covalently linked to proteins by acylation. Of these, two types are observed rather frequently: myristoylation and palmitoylation (Grand, 1989). a. Myristoylation. In myristoylation, a 14-carbon saturated fatty acid, myristic acid, is linked to an N-terminal glycine residue (McIlhnney, 1998). This glycine is originally positioned next to the initiator methio-

4997 / C9-305 / 03-27-00 09:50:21

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

305

FIG. 5. Schematic representation of lipid anchors. The arrow heads and tails represent the N termini and the C termini of mature proteins, respectively. (a) Palmitoylation, (b) N-myristoylation, (c) Prenylation, (d) GPI anchor

nine, which was removed during translation. The reaction is catalyzed by myristoyl CoA:protein N-myristoyltransferase cotranslationally. The consensus sequence of this reaction is ‘‘(NH2⫺)MGXXXS,’’ where ‘‘NH2⫺’’ represents the N terminus. The third ‘‘X’’ is preferably uncharged and the sixth serine can be other small, uncharged residues. b. Palmitoylation. In palmitoylation, a 16-carbon saturated fatty acid, palmitic acid, is linked to cysteine or serine (or threonine) at apparently any position in a sequence. It is a posttranslational reaction. No consensus sequences around the modification sites have been found, but for the cysteine-type palmitoylation, smaller classes of consensus may exist. For example, the existence of the N terminal ‘‘MGX0–5C’’ motif seems to show the possibility of both (N-)myristoylation and palmitoylation. Another motif, ‘‘⌽LCCX(R/K)(R/K),’’ was proposed from the analysis of G-coupled receptors (Strittmatter et al., 1990) and another, ‘‘IPCCPV,’’ was proposed from the analysis of surfactant associated proteins ( Johansson et al., 1991). c. Factors for Stable Anchoring. Although most of the palmitoylated proteins are membrane-bound, a significant fraction of myristoylated proteins are cytosolic. Additional factors are probably needed for their stable membrane anchoring. For example, like prenylated proteins, my-

4997 / C9-306 / 03-27-00 09:50:21

306

KENTA NAKAI

ristoylated proteins often have a region rich in positively charged residues near the modification site. It seems to stabilize the membrane anchoring through the electrostatic interaction with negatively charged phospholipids (McLaughlin and Aderem, 1995). In addition, myristoylation (or prenylation) sites are often found near palmitoylation sites (Resh, 1994). Lastly, both myristoylation and palmitoylation may occur dynamically, and their reactions may be used as regulatory switches for the signal transduction systems (Milligan et al., 1995; McLaughlin and Aderem, 1995). 2. Prenylation Protein prenylation (also called isoprenylation) attaches a 15-carbon, farnesyl diphosphate or a 20-carbon geranylgeranyl diphosphate to the cysteine residue near the C termini of the target proteins (Overmeyer et al., 1998; Rodrı´guez-Concepcio´ n et al., 1999a). This reaction is conserved both in animals and plants. The functions of the target proteins include signal transduction, nuclear architecture, and vesicular transport. a. Three Reactions. Three enzymes that catalyze this reaction are known: the farnesyltransferase (FTase) and the geranyl geranyltransferases (GGTases) I and II (the latter has been renamed Rab-GGTase because so far it only takes the Rab family as its substrates). FTase and GGTase I are heterodimers and share a common subunit. Both of them recognize a C-terminal motif, the ‘‘CaaX’’ box, where ‘‘C’’ is cysteine, ‘‘a’’ represents an aliphatic amino acid such as isoleucine, and ‘‘X’’ further determines the specificity. That is, if ‘‘X’’ is leucine, the substrate is geranylgeranylated by GGTase-I; if ‘‘X’’ is serine, methionine, cysteine, alanine, or glutamine, it is preferentially farnesylated by FTase. This reaction is enhanced by the presence of a basic region, rich in arginines or lysines, near the CaaX box (see previously). After prenylation, the C-terminal three residues (‘‘aaX’’) are cleaved and the new carboxy end of the cysteine is methylated. Rab-GGTase (GGTase II) has a different substrate specificity; it recognizes double cysteine motifs such as ‘‘XXCC,’’ ‘‘XCCX,’’ ‘‘CCXX,’’ and ‘‘XCXC,’’ where ‘‘X’’ represents any residues, in the C termini of the Rab family. b. Sorting of Prenylated Proteins. All known prenylated proteins seem to be anchored to the membrane; but the membrane can be either (the cytoplasmic face of ) the plasma membrane, the nuclear membrane, or some other membranes involved in the vesicular pathway (described in Section III,H). Additional signals that discriminate these membranes

4997 / C9-307 / 03-27-00 09:50:21

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

307

are not well understood, but there is a report that a nuclear localization signal combined with the CaaX box of lamin A is likely to determine its localization at the nuclear envelope (Holtz et al., 1989). In addition, the localization of the target may be dynamically regulated by this modification. The prenylated state of a novel plant calmodulin directs it to the plasma membrane; otherwise, it is localized at the nucleus (Rodrı´guezConcepcio´ n et al., 1999b). 3. Glycosylphosphatidylinositol The last class of three major membrane anchors is caused by the modification by a glycophospholipid, glycosylphosphatidylinositol (GPI) (Udenfriend and Kodukula, 1995a; Takeda and Kinoshita, 1995). They are observed in many eukaryotes, especially in protozoa and yeasts. Unlike other classes, the GPI-anchored proteins are exposed at the (extracytoplasmic) surface of the plasma membrane. Thus, we can predict the localization at the plasma membrane from the presence of a GPI anchor, although some of them are further incorporated into the cell wall in S. cerevisiae (as described in Section III,K,1). a. Biogenesis of GPI Anchor. The biosynthesis of a GPI anchor is rather complex. The precursors are equipped with the features of type I membrane proteins; that is, they have a cleavable N-terminal signal peptide and one additional membrane-spanning domain near the C terminus. One prominent feature of these precursors is that their cytoplasmic tail is very short. The reaction is posttranslational; at the endoplasmic reticulum, the signal peptide is cleaved and the protein is inserted into the membrane with the type I topology. Then, the C-terminal hydrophobic segment is cleaved proteolytically and replaced with the GPI-anchor precursor. The residue to which the GPI is attached is called the 웆 site. It is usually located at positions 5 to 10 residues N terminal to the hydrophobic region, which is about 15 to 20 residues. b. Prediction of GPI-Anchored Proteins. There is a vague consensus around this site, which may be used for the prediction of potential modification site (Udenfriend and Kodukula, 1995b; Eisenhaber et al., 1998); i.e., nearly half the residues at the 웆 site are serine and other small residues are allowed; the residues next to the 웆 site to the C terminus (the 웆 ⫹ 1 site) are similar to those at the 웆 site; most of the residues at the 웆 ⫹ 2 site are serine or alanine for protozoa and alanine or glycine for metazoa. In addition, both the 웆 ⫹ 4 and 웆 ⫹ 5 positions are rich in hydrophobic residues. In another approach, Antony and Miller (1994) compared the amino acid composition between the subseg-

4997 / C9-308 / 03-27-00 09:50:22

308

KENTA NAKAI

ment of an input sequence and the mature part of the averaged GPIanchored protein. By decreasing the segment length from the C terminus, they deduced the cleavage point. Nakai and Kanehisa (1992) also reported that a selection criterion of a type I protein with a very short (ⱕ10 residues) tail is sufficient to predict known GPI-anchored proteins. Recently, using in silico approaches, Caro et al. (1997) and Hamada et al. (1998a) screened potential GPI-anchored (cell wall) proteins from the entire open reading frames of S. cerevisiae. Hamada et al. assumed that such proteins should have a cleavable signal peptide, a serine/threonine-rich sequence for glycosylation, and a C-terminal GPI-attachment signal, composed of an attachment site and a hydrophobic stretch. The GPI-anchored proteins are related to various cellular functions. For example, they participate in the protein sorting to the apical surface of polarized cells and clathrin-independent endocytosis (see Section III, J,4). D. Nucleocytoplasmic Transport Recently, there has been an explosion in our understanding about the nucleocytoplasmic transport systems that import proteins into the nucleus and export them from there. The emerging picture is more complex than expected, maybe too complicated to directly apply this knowledge to the interpretation of localization signals at this time. This section briefly describes the total picture and then summarizes the various signals involved. As there are so many references on this subject, most of the original references were not cited. 1. Importins, Exportins, and Nuclear Pore Complex The nucleus is surrounded by the nuclear envelope, which takes on a lumenal structure connected to the endoplasmic reticulum. The transport of proteins into (and out of ) the nucleus occurs through the nuclear pore complex (NPC), a large complex composed of more than 100 different proteins (Talcott and Moore, 1999). Because NPC forms an aqueous pore across the two membranes, small proteins less than 9 nm in diameter can pass through it simply by diffusion. However, most of the transports of both proteins and RNAs are mediated by an active transport mechanism. It is now clear that there is heavy traffic through the NPC in both directions. Proteins are not only imported into the nucleus but also actively exported from it as well. There are many reasons for nuclear export. One reason is to send some shuttle proteins back after their import; another is for some viral proteins to export their replicated genomes outside the nucleus.

4997 / C9-309 / 03-27-00 09:50:22

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

309

The basic mechanism of these transportations is as follows (Mattaj and Englmeier, 1998; Ohno et al., 1998; Wais, 1998; Go¨ rlich, 1998; Wozniak et al., 1998; Smith and Raikhel, 1999). The target proteins (cargo) have either some import signal(s) or some export signals(s). The import signals are recognized by a family of proteins called importins, and the export signals are recognized by exportins. Sometimes, the recognition event itself is not made by these receptors but is mediated by some adaptor proteins. The complex of the cargo, the receptor, and sometimes the adaptor is transported to the other side of the nuclear envelope. The final release of the bound cargo is regulated by the function of Ran, a member of the small GTPase family. A typical example of adaptors and receptors is importin 움 (also called karyopherin 움) and importin 웁 (also called karyopherin 웁), respectively. Importin 움 recognizes a classic type of nuclear localization signal (see Section III,D,2) and binds to the cargo protein. In mammalian cells, there are three subfamilies of such adaptors: Rch1, NPI-1, and Qip1. It seems possible that different adaptors have different specificities in vivo, but this problem is still under investigation. Importin 움 binds to importin 웁, which has the ability to shuttle between the nucleus and the cytosol. There are also a number of importin 웁-like proteins; 14 examples of these were found in the genome of S. cerevisiae. All of them may be players in the transport, but most of them are orphan receptors (i.e., their ligands are not known). Not only the proteins but also RNAs can become cargo. Moreover, a receptor can bind simultaneously to different kinds of adaptors or cargo. The three-dimensional structures of two receptors (including importin 웁) were recently determined (Mattaj and Contl, 1999). 2. Nuclear Import Signals As described previously, many kinds of adaptors and receptors are likely to have a variety of specificities for various cargo. However, most of these specificities have not been clarified and there are too few known cargo for each receptor to be generalized for predictive analyses. a. NLSs. Examples of known transport signals are listed in Table II. The most famous signal is called the NLS (nuclear localization signal) and its many examples are already known (Garcia-Bustos et al., 1991; Hicks and Raikhel, 1995). Note that the term NLS is often used to represent the import signal in general. There are two types of NLSs: the (simple) NLS and the bipartite NLS. The simple NLS was first found in the SV40 large T antigen. It is a stretch of polypeptide, rich in basic (positively charged) residues and sometimes proline residue(s). Pro-

4997 / C9-310 / 03-27-00 09:50:22

310

KENTA NAKAI

TABLE II Examples of Nucleaocytoplasmic Transport Signalsa Signal Simple basic NLS

Bipartite basic NLS

M9 domain

(STAT 1)

KNS (U snRNPs)

(ribosomal proteins) NES (importin 움) RRE (RNA signal) a

Features SV40 type; a stretch rich in basic residues and often in proline residues Nucleoplasmin type; two clusters containing basic residues, separated by spacer of about 10 aa Found in hnRNP A1; Gly-rich 38 aa stretch; also works as export signal Uncharacterized; requires phosphorylation of Tyr and dimerization Residues 323-361 of hnRNP K; also works as export signal Sm proteins and trimethylguanosine Cap structure of RNA Basic residue-rich but not NLS Found in HIV-1 Rev; leucine-rich unknown Rev responsive element; 234nt segment of env in HIV-1

Receptor/adaptor Importin 웁/importin 움 family Importin 웁/importin 움 family

Transportin/none Importin 웁/NPI-1

Unknown/unknown Importin 웁/Snurportin

Kap123p or Pse1p/none Exportin 1/none CAS/none Exportin 1/Rev

The last three signals are used for nuclear export.

lines may be found at the boundary of NLSs. It is not cleaved after the translocation, and its position in the amino acid sequence appears to be unconstrained. Because of its positional independence, simple NLSs are not easy to detect specifically. Nakai and Kanehisa (1992) used empirical rules for detecting them in PSORT/PSORT II, but their general specificity is unknown. Another classical NLS, bipartite NLS, was first found in Xenopus nucleoplasmin. It is comprised of two interdependent basic regions separated by a spacer of about 10 residues. Unlike for the simple NLS, the rule to detect this signal, first proposed by the original authors, seems quite effective (Robbins et al., 1991). These NLSs are recognized by importin 움 and are transported by importin 웁. b. Other Signals. Several other types of signals are known, but there are only a few known examples of each type. One such example is the M9 domain, which is a segment of 38 residues rich in glycines and aromatic residues. It is observed in hnRNP A1 protein and its relatives,

4997 / C9-311 / 03-27-00 09:51:15

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

311

and is directly recognized by transportin, a importin 웁-like receptor. There is also a signal of quite different character: the NRS (nuclear retention sequence). The proteins bearing this signal are not exported from the nucleus even when they have export signals. hnRNP C protein has such a NRS, which is a segment of 78 residues. Note that the nuclear localization events are often regulated. Typical examples are found in transcription factors that respond to outside signals. There are a variety of mechanisms for such regulations (Vandromme et al., 1996). For example, a localization signal may usually be hindered by another protein, a complete signal may be created on assembly with another protein, a factor may usually be linked to the membrane, or a phosphorylation may make a new signal. The interpretation of such regulations from the amino acid sequences alone would be rather difficult. 3. Nuclear Export Signals The signals that direct the passenger proteins out of the nucleus are called nuclear export signals (NES). Again, the term NES may be used to represent either one classic type or all types of export signals in general. A classic type of NES was first found in the Rev protein of HIV-1. Later, some other members were added. It is a leucine-rich signal, and the consensus sequence is ‘‘LX2–3⌽X2–3LX(L/I),’’ where ⌽ is a hydrophobic residue. Other examples of export signals are also known. Some import signals can also work as export signals. In addition, some RNA sequences, or some structure of RNA such as the monomethyl guanosine cap structure, can be also recognized as export signals.

E. Mitochondrial Targeting Signals Most of the mitochondrial proteins are nuclear encoded and thus must be targeted into mitochondria and sorted into some of their components after their synthesis at the cytosol. Because mitochondria have two membranes, there are four localization sites: the matrix, the inner membrane, the intermembrane space, and the outer membrane (Fig. 6). Although there has been considerable progress in our understanding of these processes, some questions still remain. Moreover, the total picture is rather complicated and contains many exceptions. A simplified view is presented here based mainly on the view of Pfanner and Mihara (Mihara and Omura, 1996; Pfanner et al., 1997; Pfanner, 1998). There are also a number of other excellent reviews on this subject (Schatz, 1996; Stuart and Neupert, 1996; Neupert, 1997; Roise, 1997).

4997 / C9-312 / 03-27-00 09:51:15

312

KENTA NAKAI

FIG. 6. Internal structure of mitochondria.

1. General Pathway Through Tom–Tim Complexes Many mitochondrial proteins are first synthesized as a preprotein, which has an extension presequence on its N termini. The extension is used as the targeting signal to mitochondria. Various cytosolic factors may recognize this signal, but of these factors, MSF (mitochondrial import stimulation factor) seems to play an important role. It binds to the presequences and directs them to the Tom70–Tom37 receptor on the outer membrane. The ‘‘Tom’’ family means ‘‘translocases of the outer membrane,’’ whereas there is also the ‘‘Tim’’ family on the inner membrane. The preprotein is then transported to the general import pore of the outer membrane, which includes Tom40, Tom22, Tom20, and Tom5. There are also many preproteins that are not bound to MSF and are directly transported into the import pore with the aid of an general chaperone, Hsp70. It is known that Tom20, Tom22, and Tom5 proteins, including many acidic residues, can electrostatically interact with the presequence that has several positively charged residues. In addition, the Tim23 protein, a part of the Tim23–Tim17 inner membrane channel, has more negative charges and can attract preproteins even more strongly. This ‘‘acid-chain’’ effect appears to help the efficient translocation of preproteins across the outer membrane. Then, the preprotein translocates across the inner membrane and its presequence is cleaved by the matrix processing peptidase (MPP). Thus, a preprotein is transported to the matrix by default. Therefore, the presequence is also called the matrix-targeting signal. 2. Variations of Theme Some variations of the preceding pathway are used for further sorting. a. Outer Membrane Proteins. First, most outer membrane proteins do not have presequences but rather internal targeting signals. Many of

4997 / C9-313 / 03-27-00 09:51:15

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

313

them are targeted to the Tom40 channel as usual. Then they are inserted into the channel, but the insertion is stopped by the ‘‘stop-transfer’’ mechanism. Finally, they are released into the membrane laterally, with the aid of a Tom7 protein. b. Intermembrane Space Proteins 1. Another variation is seen for the intermembrane space proteins without presequences, such as cytochrome heme lyases. After the translocation across the outer membrane, they remain in the intermembrane space, although the reason is unclear. For a single-spanning protein, the mean hydrophobicity of the spanning region seems important because increasing the hydrophobicity of an outer protein causes its localization at the inner membrane (Steenaart and Shore, 1997). c. Intermembrane Space Proteins 2 and Inner Membrane Proteins 1. Third, there are examples of other intermembrane space proteins and some inner membrane proteins with presequences. They use both the Tom and Tim channels, and their presequence is cleaved at the matrix side of the inner membrane. The targeting signal of this type of intermembrane space proteins has a bipartite structure. After the cleavage of the presequence, the second signal, which is similar to the bacterial signal peptides, emerges. The subsequent step is controversial, but either they are reexported into the intermembrane space from the matrix, or their translocations are stopped by the ‘‘stop-transfer’’ mechanism similar to the inner membrane proteins. For intermembrane-space proteins, the second signal is also cleaved proteolytically. d. Inner Membrane Proteins 2. Most of the inner-membrane proteins, which are often multispanning membrane proteins, do not have presequences. Typical examples are the large family of carrier proteins, which use a branch of the general pathway. That is, after the usual translocation across the outer membrane with the aid of Tim10 and Tim12, they are transported to the Tim22–Tim54 complex, where they are integrated into the inner membrane. Although this is an interesting problem, the topogenic mechanism of membrane proteins are not discussed here (but see later) (Stuart and Neupert, 1996). Clearly, there are many exceptions to this view. Theoretically, it is interesting to see how much a unifying model can explain the localization of the total set of mitochondrial proteins. 3. Sorting Signals and Their Processing The sequence features of presequences are well known; typically they are from 20 to 80 residues long; preferably contain basic residues, serine,

4997 / C9-314 / 03-27-00 09:51:15

314

KENTA NAKAI

and alanine; but have no or few acidic residues. They are likely to form an amphiphilic 움-helix conformation. Their domain structure has been proposed, but it is not as clear as that for signal peptides (von Heijne et al., 1989). a. Mitochondrial Processing Peptidase. The sequence pattern around the cleavage site of presequences is not as clear as the (⫺3, ⫺1) rule either. Even biochemical analysis of purified MPP fails to find a clear substrate specificity, although the importance of the number of nearby basic residues with suitable distances was suggested (Song et al., 1996). It is considered that MPP recognizes a three-dimensional motif (Luciano and Ge´ li, 1996). In addition, the cleavage site does not show the end point of signal information. A set of consensus patterns based on the position of amino-terminal arginine residue has been proposed (Gavel and von Heijne, 1990). The cleavage-site consensus was recently reexamined using a sophisticated neural network model and three motifs were found, where arginine positions were at ⫺10, ⫺3 or ⫺2. Of these the R-10 motif is a cleavage site by MIP (see later) (Schneider et al., 1998). A predictor based on this finding was constructed. b. Mitochondrial Intermediate Peptidase. There is also a peptidase, mitochondrial intermediate peptidase (MIP), which processes proteins after the removal of their presequences by MPP. A consensus pattern around its cleavage site is ‘‘RX 兩 (F/L/I)X2(T/S/G)X4 兩,’’ where the second ‘‘兩’’ represents the cleavage site by MIP (Branda and Isaya, 1995). c. Internal Signal. Compared with the presequences, the nature of internal targeting signals has not been well characterized. Because MTF recognizes both presequences and internal signals, they seem to share some common features, such as the richness of basic residues. The topogenic signals of membrane proteins are still under investigation. The positive-inside rule seems to hold for mitochondrially encoded proteins and for some but not all imported proteins (G. von Heijne, personal communication, 1999). 4. Prediction of Targeting Peptides Prediction of mitochondrial targeting signals is not an easy task. The proposed amphiphilic nature is not clear enough. Nakai and Kanehisa (1992) developed a simple method based on the amino acid composition of the segment of most amino-terminal 20 residues. In addition, a simple rule to discriminate the bipartite signal of intermembrane-space proteins was also included in PSORT.

4997 / C9-315 / 03-27-00 09:51:15

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

315

Claros (1995) released an attractive program, MitoProt. In this program, various sequence features of a potential signal region are reported to assist in the user’s decision making. Later, an objective prediction method that combines many sequence features by the discriminant analysis was proposed (Claros and Vincens, 1996). With a cross-validation test, its accuracy was estimated to be 75%. Fujiwara et al. (1997) proposed an HMM that can detect mitochondrial targeting signals. The HMM was automatically created to best explain the training data. Although it could model the signals in the training data, further analysis using more data is desirable because the model has many numeric parameters. F. Peroxisomal Targeting Signals Peroxisomes are ubiquitous organelles enclosed by a single membrane. They contain various enzymes that take part in cellular functions including 웁 oxidation of fatty acids and hydrogen peroxide metabolism. All of these proteins are transported from the cytosol after their synthesis. A large set of proteins, called peroxins, are engaged in the biogenesis of peroxisomes. Some of them are related to protein import apparatus (McNew and Goodman, 1996; Subramani, 1998; Olsen, 1998; Kunau, 1998; Crookes and Olsen, 1999). 1. Matrix Proteins a. PTSs. The matrix proteins of peroxisomes are first synthesized at free ribosomes and are posttranslationally incorporated according to their targeting signals. Two kinds of peroxisomal targeting signals (PTSs) for matrix proteins are known: PTS1 and PTS2. PTS1 is a carboxy-terminal motif, ‘‘SKL’’ or its conservative substitutions. Passenger proteins bearing PTS1 are recognized by a Pex5p protein (in yeast) and are targeted to the Pex13p receptor, protruding from the peroxisomal membrane. PTS2 is a segment of 9 residues usually existing within the N terminal 20 to 30 residues. Its consensus pattern is ‘‘(R/K)(L/V/I)X5(H/Q) (L/A).’’ So far, much fewer examples than those of PTS1 have been found. In plants and mammals PTS2 is cleaved in the matrix, whereas in yeast it is not. It seems that the cleavage is unrelated to the translocation. Its receptor is Pex7p (in yeast) and the complex of Pex7p and the passenger is targeted to its membrane receptor, Pex14p. Interestingly, Pex14p can also interact with Pex5p. The subsequent import mechanism has not been clarified, but it has a remarkable character; it can pass proteins in a folded or even oligomerized state.

4997 / C9-316 / 03-27-00 09:51:15

316

KENTA NAKAI

b. Predictive Work. Recently, Geraghty et al. (1999) performed a systematic search of potential peroxisomal proteins coded in the yeast genome. They searched for the patterns, ‘‘(S/A/C)(K/H/R)L,’’ ‘‘(S/A)(Q/N)L,’’ and ‘‘SKF’’ at the C terminus of proteins, longer than 99 amino acids. They also searched for the pattern, ‘‘RLX5HL’’ within the first 25 residues of proteins, longer than 99 amino acids. They not only could detect most of the known proteins, but also could find an additional 18 candidates. Experimentally they confirmed that at least 10 of them are peroxisomal. 2. Membrane Proteins Owing to the small number of known peroxisomal membrane proteins, the signals (mPTSs) that direct them to the membrane are not well characterized. In one example, mPTS is found in the fourth loop (20 residues) of a 6 membrane-spanning protein. The loop faces to the matrix side. In another example, mPTS was found in the N-terminal 40residue segment on the matrix side. Both of them contain a 5-residue stretch rich in basic amino acids and seem to be a part of the mPTS. Surprisingly, it turned out that some of peroxisomal membrane proteins are synthesized at the ER, cotranslationally. This seems a unique example that breaks the independence between the free ribosome system and the membrane-bound ribosome system (another example is found in a sorting mechanism into the vacuole). This phenomenon may be interpreted that the peroxisome may originate from the endoplasmic reticulum, evolutionally. G. Chloroplast Transit Peptides Chloroplasts are a typical type of plastid that performs various metabolic reactions as well as photosynthesis. Their envelope consists of two membranes: the outer envelope membrane and the inner membrane (Fig. 7). The space between these two membranes is called the intermembrane space, and the space enclosed by the inner envelope membrane is called the stroma. In addition, chloroplasts have another membrane system within the stroma; the thylakoid membrane forms the lumen. Therefore, there are six different localization sites and, of course, multiple pathways to each site. Naturally, their sorting mechanisms are very complicated. 1. Features of Stromal Targeting Signal Most chloroplastic proteins, except those of the outer envelope membrane, have an N-terminal extension that is usually cleaved during matu-

4997 / C9-317 / 03-27-00 09:51:15

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

317

FIG. 7. Internal structure of chloroplasts.

ration. It is used as a chloroplast transit peptide, directing the passenger to the chloroplasts. Like the mitochondrial targeting signal, it directs the passenger to traverse two envelope membranes by default. Therefore, it is also called the stromal targeting signal. Its length can vary from 30 to more than 100 residues. It is rich in serine and threonine but deficient in acidic residues. It is not clear whether it has the three-domain structure like the signal peptide (von Heijne et al., 1989). Recently, a chloroplast-processing enzyme was identified as the general stromal processing peptidase (Richter and Lamppa, 1998). Although the chloroplast transit peptide has been studied relatively thoroughly, further intrachloroplastic sorting signals have not been well understood. Some of them are introduced below in the context of describing the known sorting pathways (Keegstra and Cline, 1999; Chen and Schnell, 1999; Cline and Henry, 1996; Kouranov and Schnell, 1996). 2. Pathways to Envelope and Stroma Like the Tom and Tim systems on mitochondrial outer and inner membranes, chloroplasts use the Toc and Tic systems on their outer and inner envelope membranes. Although there may not be a direct correspondence between both subunits, their functions for protein translocation appear quite similar. Thus, most of the sorting mechanisms within the envelope membranes are recognized as variations of the general sorting pathway to the stroma. Part of the sorting pathway to the outer envelope membrane (such as Toc75) is mediated by a bipartite signal that consists of the chloroplast

4997 / C9-318 / 03-27-00 09:51:15

318

KENTA NAKAI

transit peptide and the C-terminal hydrophobic region. It is likely that the second part works as a stop-transfer signal at the outer envelope membrane. However, there is another class of outer membrane proteins (such as Toc34) that do not have a cleavable transit peptide. These proteins appear to be relatively small. A study suggested that their signal resides on the N-terminal 30 residues (Li and Chen, 1996). The signal consists of a positively charged N-terminal portion followed by a hydrophobic core, although it is not certain whether this feature is general. The sorting process to the intermembrane space is not known. The inner envelope membrane proteins have a cleavable N-terminal transit peptide, as well as some hydrophobic domain(s) in their mature portion. There are two possibilities on the role of this hydrophobic domain; it may work as an N-terminal signal peptide after the translocation into the stroma and the subsequent cleavage of the transit peptide. Alternatively, it may work as a stop-transfer signal. One more important question is how the distinction is made between the outer membrane proteins, the inner membrane proteins, and the thylakoid membrane proteins. It is still an enigma. 3. Pathways through Thylakoid Membrane All thylakoidal proteins seem to be first translocated into the stroma through the previously mentioned general import pathway; all of them have a cleavable N-terminal transit peptide. However, there are at least four different pathways into the thylakoid membrane (Robinson et al., 1998; Schnell, 1998). Most of them are reminiscent of the pathways of bacteria, described in Section II,B,1. It is not surprising because chloroplasts are most likely evolved from a prokaryotic endosymbioint, but there are certain differences. The first pathway is the Sec pathway. A SecA homolog (cpSecA) was purified. The process is then ATP-dependent. The second pathway is the SRP-like pathway. Like the bacterial counterpart, it is used by hydrophobic membrane proteins and is GTPdependent. However, unlike the bacterial system, the reaction is posttranslational, although it may be also used for the cotranslational transport of plastid-encoded membrane proteins. The third pathway is the ⌬pH pathway (Settles and Martienssen, 1998). As described earlier, it is similar to the TAT-dependent pathway of bacteria. The proteins using this pathway have a second signal with a characteristic twin-arginine motif. The fourth pathway is used by some integral membrane proteins. They seem to be inserted spontaneously into the membrane without any energy and proteineous machinery requirement.

4997 / C9-319 / 03-27-00 09:51:15

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

319

Although not yet established, the dependency on the preceding pathways is likely to be determined by the second signal part. As with the bacterial system, the distinction between the Sec system and the SRPlike system is made from the hydrophobicity of the second signal. 4. Prediction of Transit Peptide Nakai and Kanehisa (1992) used a simple discriminant function based on the amino acid composition of N-terminal 20 residues for detecting chloroplast transit peptide in their PSORT system. The second hydrophobic signal was also detected, and it was regarded as evidence of the targeting to the thylakoid membrane. A weight matrix reflecting the cleavage site of the second signal made by Howe and Wallace (1990) was also included. Recently, Emanuelsson et al. (1999) constructed a prediction system of chloroplast transit peptides, named ChloroP. The technique used is very similar to the one used to construct the predictor of signal peptides, SignalP. It uses the artificial neural network technology and can also predict the cleavage site of transit peptides. Notably, they claim that the annotation of the public database on the cleavage sites can often be contaminated by some subsequent proteolytic reactions and that their system could detect a stronger preference of sequence pattern around the cleavage site, which seems quite probable. H. Sorting via Transport Vesicles As stated previously, there are various mechanisms by which cargo proteins are transported through the cytosol usually accompanying a receptor molecule. From this section, examples of the other strategy for the cargo transport are mentioned; cargo is encapsulated within a small vesicle and the vesicle is transported to the target membrane (Rothman and Wieland, 1996). Understanding of the molecular mechanisms of this pathway has greatly increased in recent years. Some important notions, such as the bulk flow, have been reconsidered. Although some sorting signals related to this pathway have been found, however, they are still insufficient to explain the majority of the data. In addition, the pathways are complicated and some of them are still speculative. Moreover, the transportation system is so dynamic that the static notion of ‘‘localization site’’ seems sometimes inappropriate. Therefore, selected (maybe oversimplified) knowledge on the sorting mechanisms is presented to illustrate the known signals. 1. Transport Machinery a. Retrograde versus Anterograde. Vesicles containing cargo (whether soluble or membrane-bound) are formed by a budding process (Schek-

4997 / C9-320 / 03-27-00 09:51:15

320

KENTA NAKAI

man, 1996; Schmid, 1997; Le Borgne and Hoflack, 1998b). After the transportation to a target membrane, the vesicles are fused with the membrane and the cargo proteins are released. For a soluble resident protein, it seems possible to be engulfed within the vesicles accidentally. Thus, there are also pathways for retrieving mistargeted proteins and recycling the vesicles used. Such a pathway is called the retrograde pathway, whereas the usual one is called the anterograde pathway. So a protein residing at a localization site can have a signal either for preventing its entering new transport vesicles (called a retention signal) or for being selected by vesicles in the retrograde pathway (a packaging/ transport signal). Both examples are known. The latter kind is usually observed so far. b. COPs and SNAREs. In many cases, the budding process is initiated by the oligomerization of coat proteins, which sculpt the vesicle. Three kinds of coat proteins are well characterized: clathrin, COP I, and COP II; but there are also other coat proteins. Cargo proteins having a packaging signal are condensed around the budding site by the interaction with the coat proteins or some associated proteins. After budding, the coat proteins are dissociated from the vesicle. Then the vesicle is targeted to the destination membrane by using a specific pair of SNARE proteins; that is, the vesicle has a specific SNARE protein, a v-SNARE (vesicleSNARE), and the target membrane also has another kind of SNARE, a t-SNARE. Several pairs of SNAREs correspond to the transport pathways. The two SNAREs are specifically associated through the complex formation of the two-stranded coiled-coil structure. This binding and the function of associating proteins such as NSF and SNAPs induce the fusion of the vesicle with the target membrane. Although SNAREs are important to ensure the correct transport of vesicles, the signal recognition process at the bud formation is more important for the prediction of localization. c. Selective Export. It was once hypothesized that there is a bulk flow in the secretory pathway. According to the hypothesis, a protein without any signals, except a signal peptide, is transported to the plasma membrane by default. However, it is now likely that even secreted proteins could have some signals for their rapid export to the cell surface, although their nature has not yet been clarified. It is possible that a modification, N-linked glycan, works as a signal for the quality control for protein folding in the ER (Fiedler and Simons, 1995). In addition, there is a report that a diacidic signal (‘‘DXE’’) on the cytoplasmic tail works as a signal for selective export from the ER (Nishimura and Balch, 1997).

4997 / C9-321 / 03-27-00 09:51:15

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

321

I. Endoplasmic Reticulum, Golgi Apparatus, and Secretory Pathway In a simplified view, the total flow is as follows (Fig. 8). Both soluble and membrane proteins that are translated at the membrane-bound ribosome are first localized at the ER. Some of them are transported to the Golgi apparatus, whereas others remain at the ER. At the Golgi apparatus, including the trans Golgi network (TGN), the next selection occurs; some are transported to the plasma membrane, others to the endosome and to the lysosome/vacuole finally, and still others remain there. The lysosome is also an important organelle for the other transport system, the endocytic pathway. In this pathway, proteins at the plasma membrane are internalized by endocytosis. The sorting to lysosomes is treated in the next section. 1. Sorting to ER The mechanisms of ER protein sorting are relatively well known (Teasdale and Jackson, 1996). Most, if not all, soluble ER proteins have a well-conserved C-terminal tetrapeptide, ‘‘KDEL’’ (‘‘HDEL’’ for yeasts) in addition to a cleavable N-terminal signal peptide. A small set of membrane proteins also has this signal at the C terminus. This signal

FIG. 8. Secretory pathway and endocytic pathway: a simplified view.

4997 / C9-322 / 03-27-00 09:51:15

322

KENTA NAKAI

turned out to be a transport/packaging signal in the retrograde pathway; it is recognized at the Golgi apparatus by the Erd2p protein, the socalled KDEL receptor. Then, the receptor–ligand complex is transported back to the ER. A set of ER membrane proteins with the NexoCcyt topology have a ‘‘KKXX,’’ or a ‘‘KXKXX’’ at their C terminus. Note that the C termini of the membrane proteins of this topology are located at the cytosolic side. The protruding segment is called the cytoplasmic tail. The motif is called the dilysine motif (or the ‘‘KKXX’’ motif ). Although the flanking residues can affect the retention efficiency, the existence of these two lysines is essential. The spacing between the membrane and the motif can also have an effect. For multispanning proteins, the signal seems to work if its C terminus is positioned at the cytosol. For type II membrane proteins, in which their N-terminal segment is exposed at the cytosolic side, the signal was found on their N terminus. There are relatively few type II proteins in the ER. This time, a diarginine motif, ‘‘XXRR,’’ is important. Its variations include ‘‘XRR,’’ ‘‘XXXRR,’’ ‘‘XRXR,’’ and ‘‘XXRXR.’’ The replacement of ‘‘R’’ with ‘‘K’’ may be allowed in some cases. Many ER proteins have neither the KDEL signal nor the dilysine signal. Many of them seem to be retrieved from the Golgi apparatus by a mechanism dependent on a Golgi membrane protein, Rerlp (Sato et al., 1996; Sato et al., 1997). The signal of this Rerlp-dependent retrieval has not been fully characterized, but recent studies suggest the importance of a very hydrophobic segment of about six residues and its flanking less hydrophobic regions within transmembrane segments of cargo proteins (A. Nakano, personal communication, 1999). The vesicles used in the retrograde transport from the Golgi to the ER are coated by the COP I protein, whereas in the anterograde transport from the ER to the Golgi, the COP II protein seems to be a principal player. The dilysine motif turned out to interact with COP I, directing the retrieval to the ER. This is plausible because COP I exists on the cytosolic surface of the Golgi membrane. 2. Sorting to Golgi Apparatus The Golgi apparatus consists of several discrete cisternae and transient proteins are transported through them in the defined direction of cis to trans. During this passage, glycoproteins are processed with a series of ordered modification reactions. It is believed that each modification enzyme, glycosyltransferase, exists at some defined subsets of the cisternae in the order of reaction. Therefore, some precise sorting mechanisms must exist. After transport through these cisternae, they, as well as other

4997 / C9-323 / 03-27-00 09:51:15

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

323

proteins, are sorted to several target sites. Of course, there must be a distinction mechanism between the transients and the residents (Munro, 1998). The precise nature of the localization mechanisms at the Golgi apparatus has not yet been found (Allan and Balch, 1999). So far, no soluble residents have been found; all Golgi residents are either integral membrane proteins or lipid-anchored peripheral membrane proteins. All known glycosyltransferases are the type II single-spanning membrane protein. Their retention signal has been extensively sought (Colley, 1997). The signal seems to exist at their transmembrane domain (or at the flanking regions of the domain). Interestingly, the transmembrane domains of mammalian Golgi enzymes are about five residues shorter than those of plasma membrane proteins and are rich in bulky residues such as phenylalanine. Two hypotheses have been proposed to explain this. One is the oligomerization or ‘‘kin-recognition’’ model in which the proteins having this type of transmembrane domain are assembled with each other and finally to a large protein complex, which is difficult to transport further. The other is the lipid-sorting or ‘‘bilayer-thickness’’ model, which assumes the presence of patches (rafts) of lipids with different thicknesses. In fact, lipid rich in sterol/sphingolipid is thicker than lipid rich in phospholipids. Proteins with a shorter spanning region tend toward the thinner lipid regions and are thus segregated. Several experiments support the bilayer-thickness model, but other possibilities remain. There are other types of sorting mechanisms. For example, some proteases (such as furin) of the TGN have the signal for TGN localization at their cytoplasmic tail. They contain tyrosine-containing motifs similar to the endocytic signal (see Section III,J,1). Recently, a novel Golgitargeting domain was found in several peripheral membrane proteins having a coiled-coil structure (Munro and Nichols, 1999; Barr, 1999; Kjer-Nielsen et al., 1999). The GRIP domain is about 50 residues long, and resides at the C terminus. J. Lysosome/Vacuole and Endocytic Pathway 1. Clathrin-Coated Vesicles and Adaptor Protein Complexes In endocytosis, vesicles are formed at the plasma membrane and then transported to an endosome. (More precisely, endosomes should at least be classified into early endosomes and late endosomes, but this fact is ignored here.) The endocytic pathway also includes the following routes: from the endosome to the lysosome, from the endosome to the plasma

4997 / C9-324 / 03-27-00 09:51:15

324

KENTA NAKAI

membrane, from the endosome to the Golgi apparatus, and from the Golgi to the endosome (Fig. 8) (Mellman, 1996; Marsh and McMahon, 1999). The distinction between these routes corresponds to the differential uses of transport vesicles (Le Borgne and Hoflack, 1998a); that is, the clathrin-coated vesicles with the adaptor, AP-1, are used in the pathway from the Golgi (TGN) to the endosome; the vesicles with clathrin and AP-2 are used in the pathway from the plasma membrane to the endosome (endocytosis itself ); and the ones with clathrin and AP-3 seem to be used from the endosome to the lysosome (although there is a dispute on whether or not AP-3 binds to clathrin). These adaptors of clathrin coats are comprised of four different subunits. More important, the motifs discovered in the cytoplasmic tails as endocytic signals are likely to be interpreted from the (independent) interactions with these subunits (Kirchhausen et al., 1997). For example, there are tyrosine-based motifs, represented as ‘‘YXX⌽’’ or ‘‘NFXY,’’ where ‘‘⌽’’ is a bulky and hydrophobic residue here. These motifs interact with the 애 subunits of AP-1, AP-2, and AP-3. A dileucinebased signal, which is represented as ‘‘LL’’ or ‘‘L␾,’’ where ‘‘␾’’ is a small hydrophobic residue, was also noted. This motif is shown to interact with the 웁1 subunit of AP-1, at least. One important remaining problem is the mechanisms enabling the distinction between these pathways. It seems possible that the affinities to such a signal vary for these three kinds of adaptor complexes. Neighboring sequence patterns can probably affect this. For example, neighboring acidic clusters appear to have some influence on the accessibility to the above signals for APs. Recently another signal, the phenylalanine-tryptophan motif (or the motif containing two aromatic residues, such as ‘‘(F/Y)X(F/Y),’’ in yeast), was shown to direct some proteins including cation-dependent mannose 6-phosphate receptor (see Section III,J,2) from endosomes to the TGN (Schweizer et al., 1997; Burd et al., 1998). Most probably, it is used for the (retrograde) recycling processes. A putative coat complex was also discovered and named ‘‘retromer’’ (Seaman et al., 1998). Another motif, ‘‘NPFXD,’’ was also described as a new endocytosis signal in yeast (Tan et al. 1996). More signals will be clarified in future analyses. 2. Lysosome-Targeting Signals The lysosome is an acidic organelle containing many hydrolases and can degrade most biological macromolecules (Kornfeld and Mellman, 1989). The vacuole in yeast and plants is thought to be its functional relative. Both of them are bounded by a single membrane and have

4997 / C9-325 / 03-27-00 09:51:15

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

325

various import pathways for proteins, which may become their residents or may be degraded. Most of these pathways of lysosomes and vacuoles seem to have their counterpart in each other. a. Man-6-P Signal. One major but unique pathway of lysosomal proteins is the mannose 6-phosphate-dependent pathway for soluble proteins. In this pathway, the passenger proteins are modified to have a special form of N-glycosylation bearing mannose 6-phosphate (Man-6P). Man-6-P is recognized as a lysosomal targeting signal by two forms of mannose 6-phosphate receptors (Ludwig et al., 1995; Le Borgne and Hoflack, 1998b). The problem then is how lysosomal proteins are specifically modified to have Man-6-P. There are no clearly conserved motifs around the modification sites except the ubiquitous N-glycosylation signal, ‘‘NX(S/T).’’ From the x-ray structural analysis of catepsin D, it has been proposed that a certain specific conformation, a 웁-hairpin structural motif near the modification site, is important for the recognition of glycosyltransfearase (Metcalf and Fusek, 1993). Recent x-ray analysis of another protein, 웁-glucuronidase, has verified this hypothesis ( Jain et al., 1996). A weak sequence similarity containing a variable spacer region has also been noted. b. Other Signals. Targeting signals for lysosomal membrane proteins seem to be located at the cytoplasmic tail. Two motifs, the ‘‘GY’’ motif and the ‘‘LI’’ motif, have been described. In a recent experiment, the ‘‘(D/E)EX3L(I/L/V)’’ motif was shown to selectively bind to AP-3 (Ho¨ ning et al., 1998; Burd et al., 1998) (see Section III,J,3). Another motif, the diacidic-based motif, was found, which binds to the 웁-subunit of the COP-I coatmer (Piguet et al., 1999). A nonubiquitous organelle, the melanosome, is specialized for melanin synthesis. It is somewhat similar to the lysosome, and the resident proteins are derived from the endocytic pathway. A sorting signal, the ‘‘NQPLLT,’’ was found in the cytoplasmic protein of a human membrane protein (Vijayasaradhi et al., 1995). 3. Sorting Pathways into Vacuole A variety of protein import pathways into the vacuole are known (Burd et al., 1998; Bryant and Stevens, 1998). It includes the sorting from the Golgi apparatus, endocytosis, autophagy (where a part of the cytoplasm such as a mitochondrion is engulfed into a newly formed vacuole and is degraded), direct import from the cytosol, and the vacuolar inheritance from the mother cell. Of these, the pathways from the Golgi

4997 / C9-326 / 03-27-00 09:51:15

326

KENTA NAKAI

apparatus and the direct important are important for the study of protein sorting. There are at least two pathways from the Golgi (TGN) to the vacuole: the CPY pathway and the ALP pathway (Conibear and Stevens, 1998). They were named for a typical passenger, carboxypeptidase Y and alkaline phosphatase, respectively. a. CPY Pathway. In the CPY pathway, proteins are first transported to the prevacuolar endosome and then to the vacuole. Proteins show the preproprotein structure in their sequences. Namely the presequence, the cleavable signal peptide, is first cleaved at the ER, and then the pro sequence is (often) cleaved at the vacuole. This pro sequence contains the sorting signal, which is recognized at the receptor molecule, Vps1p (in yeast). Mutational analyses indicate the importance of the fourresidue motif, ‘‘QRPL.’’ b. ALP Pathway. Unlike the CPY pathway, the ALP pathway does not use the prevacuolar endosome as an intermediate target. As in yeast, mammalian cell AP-3 is required for sorting some lysosomal enzymes. As described in Section III,J,1, a dileucine-related motif, ‘‘(D/E)EX3L (I/L/V),’’ in the cytoplasmic tail is used as a signal of this pathway. c. Cvt Pathway. The direct import pathway from the cytosol is called the Cvt (cytoplasm-to-vacuole targeting) pathway. It is unique because it does not use the early secretory pathway. Two proteins are known to use this pathway: aminopeptidase I (API) and 움-mannosidase I (Ams1p). Both of them do not have a classic signal peptide, but both of them have a propeptide in their N termini. This propeptide is required for the targeting. Unlike signal peptides, it is not hydrophobic and shows an amphiphilic nature. The process of targeting seems to be similar to that of autophagy. 4. Polarized Membrane Sorting Many cells have an asymmetric structure because of the necessity for function (Drubin and Nelson, 1996). For example, (the outer surface of ) the plasma membrane of epithelial cells is fenced by a tight junction so that the lipids are separated between the apical part and the basolateral part (Fig. 9) (Eaton and Simons, 1995). Therefore, some molecular mechanisms must exist to sort the plasma membrane proteins into these two parts. Some signals related to the secretory/endocytic pathways have been found important (Matter and Mellman, 1994). Their details are not described here because the area is too specific for predictive purposes.

4997 / C9-327 / 03-27-00 09:51:15

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

327

FIG. 9. Polarized epithelial cells.

Sorting to the basolateral membrane is mediated by signals in the cytoplasmic tail such as the tyrosine-dependent motif, whereas sorting to the apical membrane is not well characterized, although the glycolipid of GPI-anchored proteins can work as apical sorting signals. A possible role of N-glycans for an apical sorting signals has been suggested (Rodriguez-Boulan and Gonzalez, 1999). Other kinds of signals have also been reported. K. Miscellaneous Issues 1. Sorting of Yeast Cell Wall Proteins The cell wall of S. cerevisiae is composed of glucan, mannoproteins, and chitin (Klis, 1994; Cid et al., 1995). Of the mannoproteins, some of them are first synthesized as GPI-anchored and mannosylated proteins. Subsequently, they are incorporated from the plasma membrane to the cell wall and are covalently linked to the glucan there. Therefore, some signals should exist for dictating the cell wall incorporation. One found at a short N-terminal region near the GPI-attached asparagine (the 웆site) is important (Hamada et al., 1998b). Namely, a plasma membrane GPI-anchored protein was localized to the cell wall if the ‘‘(V/I)

4997 / C9-328 / 03-27-00 09:52:19

328

KENTA NAKAI

(V/I)X(V/Y)XN’’ motif (where ‘‘N’’ is the 웆 site) is created by mutagenesis. Caro et al. (1997) used a rule that plasma-membrane GPI-proteins possess a dibasic residue motif just before their predicted GPI-attachment site to distinguish them. 2. Cytoplasmic Retention Signals Because proteins are synthesized within the cytosol (including on the surface of the ER), it is plausible to postulate that the localization at the cytosol is a default. Such a notion is challenged by the discoveries of several cytoplasmic targeting/retention signals. In one example, the N-terminal 42 residue segment of cyclin B directs it to the cytosol; if it is deleted, it is transported to the nucleus; and the addition of the segment to cyclin A causes its cytosolic retention (Pines and Hunter, 1994). Another example is found as an 18-residue segment within the matrix protein of Mason–Pfizer monkey virus (Choi et al., 1999). Although it is beyond the scope of this review, the sorting of cytoskeletal and cytoskeleton-associated proteins is also a related and interesting issue (Vallee and Sheetz, 1996). 3. Differential Localization of Isoforms Some proteins have isoforms, i.e., counterparts with similar function and structure. They may be coded at different genes, probably created by gene duplications, or may be coded at the same gene but have different mRNA structures, caused by alternative splicing events. Other possible mechanisms are known. All isoforms should have similar but different amino acid sequences. In many cases these isoforms are sorted to different compartments of the cell (Gunning et al., 1998; Danpure, 1995; Mauro and Dixon, 1994). Such examples are of great interest because they provide great opportunities to test understanding on sorting signals. Prediction methods based on the amino acid composition will perform poorly in such cases. In some typical cases, the extension of a longer form constitutes a localization signal, directing differential localization. There are also some examples of differential cytosolic localization between isoforms such as actins and myosins (Gunning et al., 1998). In related cases, a protein can simultaneously localize at different sites owing to the presence of competing or inefficient signals (Danpure, 1995). L. Prediction of Localization in Eukaryotic Cells The last topic of this review is the prediction systems for the localization site of an input amino acid sequence. As stated previously, there are

4997 / C9-329 / 03-27-00 09:52:19

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

329

several programs to detect a certain signal, but it is difficult to predict the general localization from the knowledge of sorting signals only, because the known signals are not general enough to cover the resident proteins in each organelle. Therefore, most systems are based on the deviations of amino acid composition. 1. Methods Based on Amino Acid Composition In a series of works, Nishikawa and colleagues have shown that intracellular and extracellular proteins can be distinguished by their amino acid composition. In their recent work, they also used the residue pair frequency (not only the neighboring ones, but also pairs with some spacers) and reported a considerable improvement in prediction accuracy (Nakashima and Nishikawa, 1994). The reason for the correlation between the localization and the amino acid composition was sought by Andrade et al. (1998). They examined the amino acid composition of proteins with known localization and three-dimensional structure in three ways: total composition, surface composition, and interior composition. The principal component analysis showed the best correlation between the surface composition and the localization. Therefore, they concluded that the correlation is the result of evolutionary adaptation of proteins to the surrounding environment. More systematic predictions have been attempted by three groups. First, Cedano et al. (1997) performed a standard discriminant analysis between five classes, each containing 200 examples: integral membrane proteins, anchored membrane proteins, extracellular proteins, intracellular proteins, and nuclear proteins. The discrimination was rather clear, except for the distinction between anchored proteins and extracellular/ intracellular proteins. Second, Reinhardt and Hubbard (1998) performed a prediction using neural networks. From some statistical consideration, they selected three locations for prokaryotes (cytoplasmic, extracellular, and periplasmic) and four locations for eukaryotes, excluding plants (cytoplasmic, extracellular, mitochondrial, and nuclear). They did not include the membrane proteins because they can be distinguished rather reliably using existing methods. One potential problem of their analysis is that they only excluded sequence pairs with more than 90% identity. Nevertheless, the distinctions between pairs of groups were rather clear. The high accuracy between nuclear and cytoplasmic proteins was especially impressive. Third, Chou and Elrod (1999) reported their rather comprehensive analyses. They used up to 12 groups of localization sites: chloroplast,

4997 / C9-330 / 03-27-00 09:52:20

330

KENTA NAKAI

cytoplasm, cytoskeleton, ER, extra-cell, Golgi apparatus, lysosome, mitochondria, nucleus, peroxisome, plasma membrane, and vacuole. They proposed a covariant discriminant algorithm for their analysis, which can be regarded as an extension of the usual discriminant analysis. From the results of three ways of objective tests, they concluded that the accuracy of their method is significantly better than existing methods. This appears likely because their method is theoretically clear and does not use as many numeric parameters as neural network methods. 2. Hybrid Approach PSORT (and PSORT II) is the only existing program that uses both the knowledge of sorting signals and the information of amino acid composition (see Section II,F,2) (Nakai and Kanehisa, 1991; Nakai and Kanehisa, 1992; Nakai and Horton, 1999). An eukaryotic version of PSORT was created combining various predictors, most of which have been already stated (Nakai and Kanehisa, 1992). Note that amino acid compositions of partial segments are calculated in PSORT, and only some significant variables are used. For example, amino acid composition is calculated from a predicted mature portion to discriminate lysosomal proteins; in another example, mitochondrial proteins are discriminated by the data of N-terminal 20-residue segment. In addition to predicting localization sites PSORT/PSORT II produces useful diagnostic messages about the presence or absence of various known signals, although the knowledge regarding sorting signals incorporated into PSORT/PSORT II is somewhat outdated. The entire source code of PSORT II is freely distributed upon request, hoping the contribution for progress on this unique prediction problem. PSORT should improve both the optimization technique for combining obtained values and the subprograms for detecting various features. IV. CONCLUDING REMARKS In 1988, von Heijne wrote a comprehensive review on protein sorting signals (von Heijne, 1988). In 1991, I also summarized the latest knowledge on these signals at that time (Nakai, 1991). The progress of cellular biology on this subject since then is really remarkable: for most previously known signals, their receptor molecules and associated factors have been clarified. In addition, a number of novel pathways have been discovered. Signal peptides that have postulated a single function for many sequences turned out not to be true. Mitochondrial targeting signals, among others, are recognized by various molecules with different affinity. There are a number of orphan receptors for nuclear proteins. Even the apparently

4997 / C9-331 / 03-15-00 08:27:18

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

331

fundamental principle in the secretory pathway, the bulk flow, is being challenged by novel experimental results. One signal can be used for both the secretory pathway and the endocytic pathway. Some peroxisomal membrane proteins are exceptionally transported through the secretory pathway, whereas some vacuolar proteins are transported directly without the aid of vesicles. Despite these new discoveries, this knowledge is not likely to raise our prediction rate drastically. The knowledge is getting more and more precise and it still cannot cover many examples. The description in textbooks on protein sorting is not as general as expected, but progress is being made. The need for quantitative understanding of cellular processes is pressing when applied to the design of proteins, for example. Ongoing systematic experiments determining the subcellular localization of a large set of gene products will even stimulate the need for theoretical approaches (Burns et al., 1994). A static scheme of ‘‘subcellular localization site’’ is not appropriate to describe the dynamic flow of proteins in the secretory/endocytic pathway, for example. Rather, we should proceed to ‘simulate’ the flow at the molecular level in the future. The challenge of recognizing other kinds of signals, such as modification signals, degradation signals, and even transcriptionally regulatory signals, should be taken up. Was a common principle found in protein sorting? Schatz and Dobberstein (1996) discussed some similarity in import/export systems between various organelles. It is most interesting to know why apparently vague sequence patterns can be so specific. One key seems to lie in the fact that we have seen several different phenomena as a whole (as in the case of signal peptides). The next 10 years will also be exciting for sequence analysts. ACKNOWLEDGMENTS I would like to thank the editor of this volume, Peer Bork, for his invitation to write this chapter and for his patience while waiting for me to finish writing. I am also grateful to Gunnar von Heijne, Koreaki Ito, Akihiko Nakano, and Paul Horton for providing me with helpful comments on reading the manuscript before publication; to Takashi Yamanaka, Hiromitsu Araki, and Kentaro Tomii for helping to collect the references; to Tetsushi Yada for providing me with some tips on the LATEX and the tgif programs. This work was supported by a Grant-in-Aid (Genome Science) for Scientific Research from the Ministry of Education, Science, Sports, and Culture of Japan.

REFERENCES Aizawa, S. I. (1996). Flagellar assembly in Salmonella typhimurium. Mol. Microbiol. 19, 1–5. Alfano, J., and Collmer, A. (1997). The type III (Hrp) secretion pathway of plant pathogenic bacteria: trafficking hairpins, Avr proteins, and death. J. Bacteriol. 179, 5655– 5662.

4997 / C9-332 / 03-27-00 09:52:20

332

KENTA NAKAI

Allan, B., and Balch, W. (1999). Protein sorting by directed maturation of Golgi compartments. Science 285, 63–66. Anderson, D., and Schneewind, O. (1997). A mRNA signal for the type III secretion of yop proteins by Yersinia enterocolitica. Science 278, 1140–1143. Andersson, H., Bakker, E., and von Heijne, G. (1992). Different positively charged amino acids have similar effects on the topology of a polytopic transmembrane protein in Escherichia coli. J. Biol. Chem. 267, 1491–1495. Andrade, M., O’Donoghue, S., and Rost, B. (1998). Adaptation of protein surfaces to subcellular location. J. Mol. Biol. 276, 517–525. Antony, A. C., and Miller, M. E. (1994). Statistical prediction of the locus of endoproteolytic cleavage of the nascent polypeptide in glycosylphosphatidylinositol-anchored proteins. Biochem. J. 298, 9–16. Arrigo, P., Giuliano, F., Scalia, F., Rapallo, A., and Damiani, G. (1991). Identification of a new motif on nucleic acid sequence data using Kohonen’s self-organizing map. Comput. Appl. Biosci. 7, 353–357. Baldi, P., and Brunak, S. (1998). ‘‘Bioinformatics: The Machine Learning Approach.’’ MIT Press, Cambridge, Massachusetts. Barr, F. (1999). A novel rab6-interacting domain defines a family of Golgi-targeted coiledcoil proteins. Curr. Biol. 9, 381–384. Belin, D., Bost, S., Vassali, J., and Strub, K. (1996). A two-step recognition of signal sequences determines the translocation efficiency of proteins. EMBO J. 15, 468–478. Beltzer, J., Fiedler, K., Fuhrer, C., Geffen, I., Handschin, C., Wessels, H., and Spiess, M. (1991). Charged residues are major determinants of the transmembrane orientation of a signal-anchor sequence. J. Biol. Chem. 266, 973–978. Berks, B. (1996). A common export pathway for proteins binding complex redox cofactors? Mol. Microbiol. 22, 393–404. Bernstein, H. (1998). Membrane protein biogenesis: the exception explains the rules. Proc. Natl. Acad. Sci. U.S.A. 95, 14587–14589. Bibi, E. (1998). The role of the ribosome-translocon complex in translation and assembly of polytopic membrane proteins. Trends Biochem. Sci. 23, 51–55. Binet, R., Letoffe, S., Ghigo, J., Delepelaire, P., and Wandersman, C. (1997). Protein secretion by gram-negative bacterial ABC exporters—a review. Gene 192, 7–11. Bogsch, E., Brink, S., and Robinson, C. (1997). Pathway specificity for a ⌬pH-dependent precursor thylakoid lumen protein is governed by a ‘sec-avoidance’ motif in the transfer peptide and a ‘sec-incompatible’ mature protein. EMBO J. 16, 3851–3859. Bogsch, E., Sargent, F., Stanley, N., Berks, B., Robinson, C., and Palmer, T. (1998). An essential component of a novel bacterial protein export system with homologues in plastids and mitochondria. J. Biol. Chem. 273, 18003–18008. Boyd, D., and Beckwith, J. (1990). The role of charged amino acids in the localization of secreted and membrane proteins. Cell 62, 1031–1033. Branda, S., and Isaya, G. (1995). Prediction and identification of new natural substrates of the yeast mitochondrial intermediate peptidase. J. Biol. Chem. 279, 27366–27373. Briggs, M., Gierasch, L., Zlotnick, A., Lear, J., and DeGrado, W. (1985). In vivo function and membrane binding properties are correlated for Escherichia coli LamB signal peptides. Science 228, 1096–1099. Brink, S., Bogsch, E., Edwards, W., Hynds, P., and Robinson, C. (1998). Targeting of thylakoid proteins by the ⌬pH-driven twin-arginine translocation pathway requires a specific signal in the hydrophobic domain in conjunction with the two-arginine motif. FEBS Lett. 434, 425–430.

4997 / C9-333 / 03-27-00 09:52:20

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

333

Bruss, V., Lu, X., Thomssen, R., and Gerlich, W. (1994). Post-translational alterations in transmembrane topology of the hepatitis B virus large envelope protein. EMBO J. 13, 2273–2279. Bryant, N., and Stevens, T. (1998). Vacuole biogenesis in Saccharomyces cerevisiae: protein transport pathway to the yeast vacuole. Microbiol. Mol. Biol. Rev. 62, 230–247. Burd, C., Babst, M., and Emr, S. (1998). Novel pathways, membrane coats and PI kinase regulation in yeast lysosomal trafficking. Semin. Cell Dev. Biol. 9, 527–533. Burns, N., Grimwade, B., Ross-Macdonald, P., Choi, E.-Y., Finberg, K., Roeder, G., and Snyder, M. (1994). Large-scale analysis of gene expression, protein localization, and gene disruption in Saccharomyces cerevisiae. Genes Dev. 8, 1087–1105. Caro, L., Tettelin, H., Vossen, J., Ram, A., van den Ende, H., and Klis, F. (1997). In silico identification of glycosyl-phosphatidylinositol-anchored plasma-membrane and cell wall proteins of Saccharomyces cerevisiae. Yeast 13, 1477–1489. Cedano, J., Aloy, P., Perez-Pons, J. A., and Querol, E. (1997). Relation between amino acid composition and cellular location of proteins. J. Mol. Biol. 266, 594–600. Chaddock, A., Mant, A., Karnauchov, I., Brink, S., Herrmann, R., Klo¨ sgen, R., and Robinson, C. (1995). A new type of signal peptide: central role of a twin-arginine motif in transfer signals for the ⌬pH-dependent thylakoidal protein translocase. EMBO J. 14, 2715–2722. Chen, R., and Henning, U. (1996). A periplasmic protein (skp) of Escherichia coli selectively binds a class of outer membrane proteins. Mol. Microbiol. 19, 1287–1294. Chen, X., and Schnell, D. (1999). Protein import into chloroplasts. Trends Cell Biol. 9, 222–227. Choi, G., Park, S., Choi, B., Hong, S., Lee, J., Hunter, E., and Rhee, S. (1999). Identification of a cytoplasmic targeting/retention signal in a retroviral gag polyprotein. J. Virol. 73, 5431–5437. Chou, K.-C., and Elrod, D. (1999). Protein subcellular location prediction. Protein Eng. 12, 107–118. Chuck, S., and Lingappa, V. (1992). Pause transfer: a topogenic sequence in apolipoprotein B mediates stopping and restarting of translocation. Cell 68, 9–21. Chuck, S., and Lingappa, V. (1993). Analysis of a pause transfer sequence from apolipoprotein B. J. Biol. Chem. 268, 22794–22801. Cid, V., Duran, A., del Rey, F., Snyder, M., Nombela, C., and Sanchez, M. (1995). Molecular basis of cell integrity and morphogenesis in Saccharomyces cerevisiae. Microbiol. Rev. 59, 345–386. Claros, M. (1995). MitoProt, a macintosh application for studying mitochondrial proteins. Comput. Appli Biosci. 11, 441–447. Claros, M., Brunak, S., and von Heijne, G. (1997). Prediction of N-terminal protein sorting signals. Curr. Opin. Struct. Biol. 7, 394–398. Claros, M., and Vincens, P. (1996). Computational method to predict mitochondrially imported proteins and their targeting sequences. Eur. J. Biochem. 241, 779–786. Cline, K., and Henry, R. (1996). Import and routing of nucleus-encoded chloroplast proteins. Annu. Rev. Cell Dev. Biol. 12, 1–26. Colley, K. (1997). Golgi localization of glycosyltransferases: more questions than answers. Glycobiology 7, 1–13. Conibear, E., and Stevens, T. (1998). Multiple sorting pathways between the late Golgi and the vacuole in yeast. Biochim. Biophys. Acta 1404, 211–230. Cowan, S., and Rosenbusch, J. (1994). Folding pattern diversity of integral membrane proteins. Science 264, 914–916.

4997 / C9-334 / 03-27-00 09:52:20

334

KENTA NAKAI

Cristo¨ bal, S., de Gier, J.-W., Nielsen, H., and von Heijne, G. (1999). Competition between sec- and tat-dependent protein translocation in Eschericha coli. EMBO J. 18, 2982–2990. Crookes, W., and Olsen, L. (1999). Peroxin puzzles and folded freight: peroxisomal protein import in review. Naturwissenschaften 86, 51–61. Cserzo¨ , M., Wallin, E., Simon, I., von Heijne, G., and Elofsson, A. (1997). Prediction of transmembrane 움-helices in prokaryotic membrane proteins: the dense alignment surface method. Protein Eng. 10, 673–676. Dalbey, R., Kuhn, A., and von Heijne, G. (1995). Directionality in protein translocation across membranes: the N-tail phenomenon. Trends Cell Biol. 5, 380–383. Dalbey, R., Lively, M., Bron, S., and von Dijl, J. (1997). The chemistry and enzymology of the type I signal peptidases. Protein Sci. 6, 1129–1138. Danese, P., and Silhavy, T. (1998). Targeting and assembly of periplasmic and outermembrane proteins in Escherichai coli. Annu. Rev. Genet. 12, 59–94. Danpure, C. (1995). How can the products of a single gene be localized to more than one intracellular compartment? Trends Cell Biol. 5, 230–238. de Cock, H., Scha¨ fer, U., Potgeter, M., Demel, R., Mu¨ ller, M., and Tommassen, J. (1999). Affinity of the periplasmic chaperone skp of Escherichia coli for phospholipids, lipopolysaccharides and non-native outer membrane proteins: role of skp in the biogenesis of outer membrane proteins. Eur. J. Biochem. 259, 96–103. de Cock, H., Struyve´ , M., Kleerebezem, M., ven der Krift, T., and Tommassen, J. (1997). Role of the carboxy-terminal phenylalanine in the biogenesis of outer membrane protein phoe of Escherichia coli K-12. J. Mol. Biol. 269, 473–478. de Gier, J.-W., Scotti, P., Sa¨ af, A., Valent, Q., Kuhn, A., Luirink, J., and von Heijne G. (1998). Differential use of the signal recognition particle translocase targeting pathway for inner membrane protein assembly in Escherichia coli. Proc. Natl. Acad. Sci. U.S.A. 95, 14646–14651. de Gier, J.-W., Valent, Q., von Heijne, G., and Luirink J. (1997). The E. coli SRP: preferences of a targeting factor. FEBS Lett. 408, 1–4. Delepelaire, P., and Wandersman C. (1998). The SecB chaperone is involved in the secretion of the Serratia marcescens hasa protein through an ABC transporter. EMBO J. 17, 936–944. Diederichs, K., Freigang, J., Umhau, S., Zeth, K., and Breed, J. (1998). Prediction by a neural network of outer membrane 웁-strand protein topology. Protein Sci. 7, 2413– 2420. Drubin, D., and Nelson, W. (1996). Origins of cell polarity. Cell 84, 335–344. Dunlop, J., Jones, P., and Finbow, M. (1995). Membrane insertion and assembly of duction: a polytopic channel with dual orientations. EMBO J. 14, 3609–3616. Duong, F., Eichler J., Price, A., Leonard, M., and Wickner W. (1997). Biogenesis of the gram-negative bacterial envelope. Cell 91, 567–573. Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998). ‘‘Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.’’ Cambridge University Press, Cambridge, Massachusetts. Eaton, S., and Simons, K. (1995). Apical, basal, and lateral cues for epithelial polarization. Cell. 82, 5–8. Eisenhaber, B., Bork, P., and Eisenhaber, F. (1998). Sequence properties of GPI-anchored proteins near the 웆-site: constraints for the polypeptide binding site of the putative transamidase. Protein Eng. 11, 1155–1161. Eisenhaber, F., and Bork, P. (1998). Wanted: subcellular localization of proteins based on sequence. Trends Cell Biol. 8, 169–170.

4997 / C9-335 / 03-27-00 09:52:20

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

335

Emanuelsson, O., Nielsen, H., and von Heijne, G. (1999). ChloroP, a neural networkbased method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci. 8, 978–984. Eppens, E., Nouwen, N., and Tommassen, J. (1997). Folding of a bacterial outer membrane protein during passage through the periplasm. EMBO J. 16, 4295–4301. Fekkes, P., and Driessen, A. (1999). Protein targeting to the bacterial cytoplasmic membrane. Microbiol. Mol. Biol. Rev. 63, 161–173. Fiedler, K., and Simons, K. (1995). The role of N-glycans in the secretory pathway. Cell 81, 309–312. Folz, R., and Gordon, J. (1987). Computers-assisted predictions of signal peptidase processing sites. Biochem. Biophys. Res. Comm. 146, 870–877. Fraser, C., Gocayne, J., White, O., Adams, M., Clayton, R., Fleischmann, R., Bult, C., Kerlavage, A., Sutton, G., Kelly, J., Fritchman, J., Weidman, J., Small, K., Sandusky, M., Fuhrmann, J., Nguyen, D., Utterback, T., Saudek, D., Phillips, C., Merrick, J., Tomb J.-F., Dougherty, B., Bott, K., Hu P.-C., and Lucier, T. (1995). The minimal gene complement of Mycoplasma genitalium. Science 270, 397–403. Fujiwara, Y., Asogawa, M., and Nakai, K. (1997). Prediction of mitochondrial targeting signals using hidden Markov models. In ‘‘Genome Informatics’’ 53–60. Miyano, S., and Takagi, T. (eds.) ‘‘Genome Informatics 1997’’ Universal Academy Press, Inc. Tokyo, Japan. Gafvelin, G., Sakaguchi, M., Andersson, H., and von Heijne G. (1997). Topological rules for membrane protein assembly in eukaryotic cells. J. Biol. Chem. 272, 6119–6127. Gafvelin, G., and von Heijne G. (1994). Topological ‘‘frustration’’ in multispanning E. coli inner membrane proteins. Cell 77, 401–412. Garcia-Bustos, J., Heitman, J., and Hall, M. (1991). Nuclear protein localization. Biochim. Biophys. Acta 1071, 83–101. Gavel, Y., and von Heijne, G. (1990). Cleavage-site motifs in mitochondrial targeting peptides. Protein Eng. 4, 33–37. Gennity, J., Kim, H., and Inouye, M. (1992). Structural determinants in addition to the aminoterminal sorting sequence influence membrane localization of Escherichia coli lipoproteins. J. Bacteriol. 174, 2095–2101. Geraghty, M., Bassett, D., Morrell, J., Gatto, Jr, G., Bai, J., Geisbrecht, B., Hietter, P., and Gould, S. (1999). Detecting patterns of protein distribution and gene expression in silico. Proc. Natl. Acad. Sci. U.S.A. 96, 2037–2042. Go¨ rlich, D. (1998). Transport into and out of the cell nucleus. EMBO J. 17, 2721–2727. Grand, R. (1989). Acylation of viral and eukaryotic proteins. Biochem. J. 258, 625–638. Gromiha, M., Majumdar, R. and Ponnuswamy, P. (1997). Identification of membrane spanning 웁 strands in bacterial porins. Protein Eng. 10, 497–500. Gunning, P., Weinberger, R., Jeffrey, P., and Hardeman, E. (1998). Isoform sorting and the creation of intracellular compartments. Annu. Rev. Cell Dev. Biol 14, 339–372. Hamada, K., Fukuchi, S., Arisawa, M., Baba, M., and Kitada, K. (1998a). Screening for glycosylphosphatidylinositol (GPI)-dependent cell wall proteins in Saccharomyces cerevisiae. Mol. Gen. Genet. 258, 53–59. Hamada, K., Terashima, H., Arisawa, M., and Kitada, K. (1998b). Amino acid sequence requirement for efficient incorporation of glycosylphosphtidylinositol-associated proteins into the cell wall of Saccharomyces cerevisiae. J. Biol. Chem. 273, 26946–26953. Hamman, B., Hendershot, L., and Johnson, A. (1998). BiP maintains the permeability barrier of the ER membrane by sealing the lumenal end of the translocon pore before and early in translocation. Cell 92, 747–758.

4997 / C9-336 / 03-27-00 09:52:20

336

KENTA NAKAI

Hartmann, E., Rapoport, T., and Lodish H. (1989). Predicting the orientation of eukaryotic membrane-spanning proteins. Proc. Natl. Acad. Sci. U.S.A. 86, 5786–5790. Hedge, R., and Lingappa, V. (1997). Membrane protein biogenesis: regulated complexity at the endoplasmic reticulum. Cell. 91, 575–582. Hicks, G., and Raikhel, N. (1995). Protein import into the nucleus: an integrated view. Annu. Rev. Cell Dev. Biol. 11, 55–88. Holtz, D., Tanaka R., Hartwig, J., and McKeon, F. (1989). The CaaX motif of lamin a functions in conjunction with the nuclear localization signal to target assembly to the nuclear envelope. Cell 59, 969–977. Ho¨ ning, S., Sandoval, I., and von Figura, K. (1998). A di-leucine-based motif in the cytoplasmic tail of Limp-II and tyrosinase mediates selective binding of AP-3. EMBO J. 17, 1304–1314. Horton, P., and Nakai, K. (1996). A probabilistic classification system for predicting the cellular localization sites of proteins. Intell. Syst. Mol. Biol. 4, 109–115. Horton, P., and Nakai, K. (1997). Better prediction of protein cellular localization sites with the k nearest neighbor classifier. Intell. Syst. Mol. Biol. 5, 147–152. Howe, C., and Wallace, T. (1990). Prediction of leader peptide cleavage sites for polypeptides of the thylakoid lumen. Nucl. Acids Res. 18, 3417–3417. Howell, S., and Crine, P. (1996). Type VI membrane proteins? Trends Biochem. Sci. 21, 171–172. Hucho, F., Go¨ rne-Tschelnokow, U., and Strecker, A. (1994). 웁-structure in the membranespanning part of the nicotinic acetylcholine receptor (or how helical are transmembrane helices?). Trends Biochem. Sci. 19, 383–387. Ito, K. (1996). The major pathways of protein translocation across membranes. Genes Cells 1, 337–346. Izadi-Pruneyre, N., Wolff, N., Redeker, V., Wandersman, C., Delepierre, M., and Lecroisey, A. (1999). NMR studies of the C-terminal secretion signal of the haem-binding protein HasA. Eur. J. Biochem. 261, 562–568. Jain, R., Rusch, S., and Kendall, D. (1994). Signal peptide cleavage regions. Functional limits on length and topological implications. J. Biol. Chem. 269, 16305–16310. Jain, S., Drendel, W., Chen, Z.-W., Mathews, F., Sly, W., and Grubb, J. (1996). Structure of human 웁-glucuronidase reveals candidate lysosomal targeting and active-site motifs. Nature Struct. Biol. 3, 375–380. Johansson, J., Persson, P., Lowenadler, B., Roberston, B., Joernvall, H., and Curstedt, T. (1991). Canine hydrophobic surfactant plypeptide sp-c. A lipopeptide with one thioester-linked palmitoyl group. FEBS Lett. 281, 119–122. Jones, D., Taylor, W., and Thornton, J. (1994). A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry 33, 3038–3049. Keegstra, K., and Cline, K. (1999). Protein import and routing systems of chloroplasts. Plant Cell 11, 557–570. Kiefer, D., Hu, X., Dalbey, R., and Kuhn, A. (1997). Negatively charged amino acid residues play an active role in orienting the sec-independent pf3 coat protein in the Escherichia coli inner membrane. EMBO J. 16, 2197–2204. Kihara, A., and Ito, K. (1998). Translocation, folding, and stability of the hflkc complex with signal anchor topogenic sequences. J. Biol. Chem. 273, 29770–29775. Kirchhausen, T., Bonifacino, J., and Riezman, H. (1997). Linking cargo to vesicle formation: receptor tail interactions with coat proteins. Curr. Opin. Cell Biol. 9, 488–495. Kjer-Nielsen, L., Teasdale, R., von Vliet, C., and Gleeson, P. (1999). A novel Golgilocalisation domain domain shared by a class of coiled-coil peripheral membrane proteins. Curr. Biol. 9, 385–388.

4997 / C9-337 / 03-27-00 09:52:20

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

337

Klein, P., Kanehisa, M., and DeLisi, C. (1985). The detection and classification of membrane-spanning proteins. Biochim. Biophys. Acta 815, 468–476. Klein, P., Somorjai, R., and Lau, P. (1988). Distinctive properties of signal sequences from bacterial lipoproteins. Protein Eng. 2, 15–20. Klis, F. (1994). Review: cell wall assembly in yeast. Yeast 10, 851–869. Kornfeld, S., and Mellman, I. (1989). The biogenesis of lysosomes. Annu. Rev. Cell Biol. 5, 483–525. Kouranov, A., and Schnell, D. (1996). Protein translocation at the envelope and thylakoid membranes of chloroplasts. J. Biol. Chem. 271, 31009–31012. Kunau, W. (1998). Peroxisome biogenesis: from yeast to man. Curr. Opin. Microbiol. 1, 232–237. Kuroiwa, T., Sakaguchi, M., Mihara, K., and Omura, T. (1991). Systematic analysis of stop-transfer sequence for microsomal membrane. J. Biol. Chem. 266, 9251–9255. Kuroiwa, T., Sakaguchi, M., Omura, T., and Mihara, K. (1996). Reinitiation of protein translocation across the endoplasmic reticulum membrane topogenesis of multispanning membrane proteins. J. Biol. Chem. 271, 6243–6248. Kutay, U., Hartmann, E., and Rapoport, T. (1993). A class of membrane proteins with a C-terminal anchor. Trends Cell Biol. 3, 72–75. Kuwajima, G., Kawagishi, I., Homma, M., Asaka, J.-I., Kondon, E., and Macnab, R. (1989). Export of an N-terminal fragment of Escherichia coli flagellin by a flagellum-specific pathway. Proc. Natl. Acad. Sci. U.S.A. 86, 4953–4957. Kyte, J., and Doolittle, R. (1982). A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157, 105–132. Ladunga, I., Czako, F., Csabai, I., and Geszti, T. (1991). Improving signal peptide prediction accuracy by simulated neural network. Comput. Appl. Biosci. 7, 485–487. Landolt-Marticorena, C., Williams, K., Deber, C., and Reithmeier, R. (1993). Non-random distribution of amino acids in the ransmembrane segments of human type I single span membrane proteins. J. Mol. Biol. 229, 602–608. Le Borgne, R., and Hoflack, B. (1998a). Mechanisms of protein sorting and coat assembly: insights from the clathrin-coated vesicle pathway. Curr. Opin. Cell Biol. 10, 499–503. Le Borgne, R., and Hoflack, B. (1998b). Protein transport from the secretory to the endocytic pathway in mammalian cells. Biochim. Biophys. Acta 1404, 195–209. Li, H.-M., and Chen, L.-J. (1996). Protein targeting and integration signal for the chloroplastic outer envelope membrane. Plant Cell 8, 2117–2126. Luciano, P., and Ge´ li, V. (1996). The mitochondrial processing peptidase: function and specificity. Experientia 52, 1077–1082. Ludwig, T., Le Borgne, R., and Hoflack, B. (1995). Roles for mannose-6-phosphate receptors in lysosomal enzyme sorting, IGF-II binding and clathrin-coat assembly. Trends Cell Biol. 5, 202–206. Marsh, M., and McMahon, H. (1999). The structural era of endocytosis. Science 285, 215–220. Martoglio, B., and Dobberstein, B. (1998). Signal sequences: more than just gready peptides. Trends Cell Biol. 8, 410–415. Matlack, K., Mothes, W., and Rapoport, T. (1998). Protein traslocation: tunnel vision. Cell 92, 381–390. Matoba, S., and Ogrydziak, D. (1998). Another factor besides hydrophobicity can affect signal peptide interaction with signal recognition particle. J. Biol. Chem. 273, 18841– 18847. Matsuyama, S., Tajima, T., and Tokuda, H. (1995). A novel periplasmic carrier protein involved in the sorting and transport of Escherichia coli lipoproteins destined for the outer membrane. EMBO J. 14, 3365–3372.

4997 / C9-338 / 03-27-00 09:52:20

338

KENTA NAKAI

Matsuyama, S., Yokota, N., and Tokuda, H. (1997). A novel outer membrane lipoprotein, LolB (HemM) involved in the LolA (p20)-dependent localization of lipoproteins to the outer membrane of Escherichia coli. EMBO J. 16, 6947–6955. Mattaj, I. W., and Conti, E. (1999). Snail mail to the nucleus. Nature 399, 208–210. Mattaj, I. W., and Englmeier, L. (1998). Nucleocytoplasmic transport: the soluble phase. Annu. Rev. Biochem. 67, 265–306. Matter, K., and Mellman, I. (1994). Mechanisms of cell polarity: sorting and transport in epithelial cells. Curr. Opin. Cell Biol. 6, 545–554. Mauro, L., and Dixon, J. (1994). ‘Zip codes’ direct intracellular protein tyrosine phosphatases to the correct cellular ‘address’. Trends Biochem. Sci. 19, 151–155. McGeoch, D. (1985). On the predictive recognition of signal peptide sequences. Virus Res. 3, 271–286. McIlhnney, R. (1998). Membrane targeting via protein N-myristoylation. Methods Mol. Biol. 88, 211–225. McLaughlin, S., and Aderem, A. (1995). The myristoyl-electrostatic switch: a modulator of reversible protein-membrane interactions. Trends Biochem. Sci. 20, 272–276. McMurry, J., and Kendall, D. (1999). An artificial transmembrane segment directs SecA, SecB, and electrochemical potential-dependent translocation of a long aminoterminal tail. J. Biol. Chem. 274, 6776–6782. McNew, J., and Goodman, J. (1996). The targeting and assembly of peoxisomal proteins: some old rules do not apply. Trends Biochem. Sci. 21, 54–58. Mecsas, J., and Strauss, E. (1996). Molecular mechanisms of bacterial virulence: type III secretion and pathogenicity islands. Emerg. Infect. Dis. 2, 270–288. Mellman, I. (1996). Endocytosis and molecular sorting. Annu. Rev. Cell Dev. Biol. 12, 575–625. Metcalf, P., and Fusek, M. (1993). Two crystal structure for cathepsin D: The lysosomal targeting signal and active site. EMBO J. 12, 1293–1302. Michiels, T., and Cornelis, G. (1991). Secretion of hybrid proteins by the Yersinia yop export system. J. Bacteriol. 173, 1677–1685. Mihara, K., and Omura, T. (1996). Cytoplasmic chaperones in precursor targeting to mitochondria: the role of msf and hsp 70. Trends Cell Biol. 6, 104–108. Milligan, G., Parenti, M., and Magee, M. (1995). The dynamic role of palmitoylation in signal transduction. Trends Biochem. Sci. 20, 181–186. Miyazawa, A., Fujiyoshi, Y., Stowell, M., and Unwin, N. (1999). Nicotinic acetycholine receptor at 4.6 A˚ resolution: transverse tunnels in the channel wall. J. Mol. Biol. 288, 765–786. Mothes, W., Heinrich S., Graf, R., Nilsson, I., von Heijne, G., Brunner, J., and Rapoport, T. (1997). Molecular mechanism of membrane protein integration into the endoplasmic reticulum. Cell 89, 523–533. Muller, G., and Zimmermann, R. (1987). Import of honeybee prepromelittin into the endoplasmic reticulum: structural basis for independence of SRP and docking protein. EMBO J. 6, 2099–2107. Munro, S. (1998). Localization of porteins to the Golgi apparatus. Trends Cell Biol. 8, 11–15. Munro, S., and Nichols, B. (1999). The grip domain—a novel Golgi-targeting domain found in several coiled-coil proteins. Curr. Biol. 9, 377–380. Nagarajan, V. (1993). Protein secretion. In Sonenshein, A. Hoch, J., and Losick, R. (eds.), ‘‘Bacillus subtilis and Other Gram-Positive Bacteria: Biochemistry, Physiology, and Molecular Genetics.’’ American Society of Microbiology, Washington, D.C. Nakahara, D., Lingappa, V., and Chuck, S. (1994). Translocational pausing is a common step in the biogenesis of unconventional integral membrane and secretory proteins. J. Biol. Chem. 269, 7617–7622.

4997 / C9-339 / 03-15-00 08:27:24

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

339

Nakai, K. (1991). Predicting various targeting signals in amino acid sequences. Bull. Inst. Chem. Res., Kyoto Univ. 69, 269–291. Nakai, K. (1996). Refinement of the prediction methods of signal peptides for the genome analyses of Saccharomyces cerevisiae and Bacillus subtilis. In Akutsa, T., Asai, K., Hagiya, M., Kuhara, S., Miyano, S., and Nakai, K. (eds.) ‘‘Genome Informatics 1996’’ Universal Academy Press, Inc., Tokyo, Japan, 72–81. Nakai, K., and Horton, P. (1999). PSORT: a program for detecting the sorting signals of proteins and predicting their subcellular localization. Trends Biochem. Sci. 24, 34–35. Nakai, K., and Kanehisa, M. (1991). PROTEINS: Structure, function, and genetics: expert system for predicting protein localization sites in gram-negative bacteria. Proteins: Struct. Funct. Genet. 11, 95–110. Nakai, K., and Kanehisa, M. (1992). A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14, 897–911. Nakashima, H., and Nishikawa, K. (1994). Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J. Mol. Biol. 238, 54–61. Navarre, W., and Schneewind, O. (1999). Surface proteins of gram-positive bacteria and mechanisms of their targeting to the cell wall envelope. Microbiol. Mol. Biol. Rev. 63, 174–229. Nelson, W. (1992). Regulation of cell surface polarity from bacteria to mammals. Science 258, 948–955. Neupert, W. (1997). Protein import into mitochondria. Annu. Rev. Biochem. 66, 863–917. Ng, T., Brown, J., and Walter, P. (1996). Signal sequences specify the targeting route to the endoplasmic reticulum membrane. J. Cell Biol. 134, 269–278. Nielsen, H., Brunak, S., and von Heijne, G. (1999). Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Eng. 12, 3–9. Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G. (1997). Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 10, 1–6. Nielsen, H., and Krogh, A. (1998). Prediction of signal peptides and signal anchors by a hidden Markov model. Intell. Syst. Mol. Biol. 6, 122–130. Nilsson, I., Whitley, P., and von Heijne, G. (1994). The COOH-terminal ends of internal signal and signal-anchor sequences are positioned differently in the ER translocase. J. Cell Biol. 126, 1127–1132. Nishimura, N., and Balch, W. (1997). A di-acidic signal required for selective export from the endoplasmic reticulum. Science. 277, 556–558. Nunn, D. (1999). Bacterial type II protein export and pilus biogenesis: more than just homologies? Trends Cell Biol. 9, 402–408. Ohno, M., Fornerod, M., and Mattaj, I. (1998). Nucleocytoplasmic transport: the last 200 nanometers. Cell 92, 327–336. Olsen, L. (1998). The surprising complexity of peroxisome biogenesis. Plant Mol. Biol. 38, 163–189. Ota, K., Sakaguchi, M., Hamasaki, N., and Mihara, K. (1998a). Assessment of topogenic functions of anticipated transmembrane segments of human band 3. J. Biol. Chem. 273, 28286–28291. Ota, K., Sakaguchi, M., von Heijne, G., Hamasaki, N., and Mihara, K. (1998b). Forced transmembrane orientation of hydrophilic polypeptide segments in multispanning membrane proteins. Mol. Cell 2, 495–503. Overmeyer, J., Erdman, R., and Maltese, W. (1998). Membrane targeting via protein prenylation. Methods Mol. Biol. 88, 249–263.

4997 / C9-340 / 03-27-00 09:52:20

340

KENTA NAKAI

Paetzel, M., Dalbey, R., and Strynadka, N. (1998). Crystal structure of a bacterial signal peptidase in complex with a 웁-lactam inhibitor. Nature 396, 186–190. Pasquier, C., Promponas, V., Palaios, G., Hamodrakas, J., and Hamodrakas, S. (1999). A novel method for predicting transmembrane segments in proteins based on a statistical analysis of the SwissProt database: the PRED-TMR algorithm. Protein Eng. 12, 381–385. Persson, B., and Argos, P. (1994). Prediction of transmembrane segments in proteins utilising multiple sequence alignments. J. Mol. Biol. 237, 182–192. Pfanner, N. (1998). Mitochondrial import: crossing the aqueous intermembrane space. Curr. Biol. 8, R262–R265. Pfanner, N., Craig, E., and Honlinger, A. (1997). Mitochondrial preprotein translocase. Annu. Rev. Cell Biol. Dev. 13, 25–51. Piguet, V., Gu, F., Foti, M., Demaurex, N., Gruenberg, J.-L. and Carpentier, J, and Trono, D. (1999). Nef-induced CD4 degradation: a diacidic-based motif in Nef functions as a lysosomal targeting signal through the binding of 웁-COP in endosomes. Cell 97, 63–73. Pilon, M., and Schekman, R. (1999). Protein translocation: how hsp70 pulls it off. Cell 97, 679–682. Pines, J., and Hunter, T. (1994). The differential localization of human cyclins A and B is due to a cytoplasmic retention signal in cyclin B. EMBO J. 13, 3772–3781. Pohlschro¨ der, M., Prinz, W., Hartmann, E., and Beckwith, J. (1997). Protein translocation in the three domains of life: variations on a theme. Cell 91, 563–566. Prange, R., and Streeck, R. (1995). Novel transmembrane topology of the hepatitis B virus envelope proteins. EMBO J. 14, 247–256. Prinz, W., Boyd, D., Ehrmann, M., and Beckwith, J. (1998). The protein translocation apparatus contributes to determining the topology of an integral membrane protein in Eschericha coli. J. Biol. Chem. 273, 8419–8424. Prinz, W., Spiess, C., Ehrmann, M., Schierle, C., and Beckwith, J. (1996). Targeting of signal sequenceless proteins for export in Escherichia coli with altered protein translocase. EMBO J. 15, 5209–5217. Pugsley, A. (1993). The complete general secretory pathway in gram-negative bacteria. Microbiol. Rev. 57, 50–108. Qi, H.-Y., and Bernstein, H. (1999). SecA is required for the insertion of inner membrane proteins targeted by the Eschericha coli signal recognition particle. J. Biol. Chem. 274, 8993–8997. Reinhardt, A., and Hubbard, T. (1998). Using neural networks for prediction of the subcellular location of proteins. Nucl. Acids Res. 26, 2230–2236. Resh, M. (1994). Myristylation and palmitylation of src family members: the fats of the matter. Cell. 76, 411–413. Richter, S., and Lamppa, G. (1998). A chloroplast processing enzyme functions as the general stromal processing peptidase. Proc. Natl. Acad. Sci. U.S.A. 95, 7463–7468. Robbins, J., Dilworth, S., Laskey, R., and Dingwall, C. (1991). Two interdependent basic domains in nucleoplasmin nuclear targeting sequence: identification of a class of bipartite nuclear targeting sequences. Cell 64, 615–623. Robinson, C., Hynds, P., Robinson, D., and Mant, A. (1998). Multiple pathways for the targeting of thylakoid proteins in chloroplasts. Plant Mol. Biol. 38, 209–221. Rodrigue, A., Chanal, A., Beck, K., Mu¨ ller, M., and Wu, L.-F. (1999). Co-translocation of a periplasmic enzyme complex by a hitchhiker mechanism through the bacterial TAT pathway. J. Biol. Chem. 274, 13223–13228.

4997 / C9-341 / 03-27-00 09:52:20

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

341

Rodriguez-Boulan, E., and Gonzalez, A. (1999). Glycans in post-Golgi apical targeting: sorting signals or structural props? Trends Cell Biol. 9, 291–294. Rodrı´guez-Concepcio´ n, M., Yalovskyy, S., and Gruissem, W. (1999a). Protein prenylation in plants: old friends and new targets. Plant Mol. Biol. 39, 865–870. Rodrı´guez-Concepcio´ n, M., Yalovskyy, S., Zik, M., Fromm, H., and Gruissem, W. (1999b). The prenylation status of a novel plant calmodulin directs plasma membrane or nuclear localization of the protein. EMBO J. 18, 1996–2007. Roise, D. (1997). Recognition and binding of mitochondrial presequences during the import of proteins into mitochondria. J. Bioenerg. Biomembr. 29, 19–27. Rost, B., Casadio, R., and Fariselli, P. (1996). Refining neural network predictions for helical transmembrane proteins by dynamic programming. Intell. Syst. Mol. Biol. 4, 192–200. Rost, B., Casadio, R., Fariselli, P., and Sander, C. (1995). Transmembrane helices predicted at 95% accuracy. Protein Sci. 4, 521–533. Rothman, J., and Wieland, F. (1996). Protein sorting by transport vesicles. Science 272, 227–2234. Sakaguchi, M. (1997). Eukaryotic protein secretion. Curr. Opin. Biotechnol. 8, 595–601. Sakaguchi, M., Tomiyoshi, R., Kuroiwa, T., Mihara, K., and Omura, T. (1992). Functions of signal and signal-anchor sequences are determined by the balance between the hydrophobic segment. Proc. Natl Acad. Sci. U.S.A. 89, 16–19. Salmond, G., and Reeves, P. (1993). Membrane traffic wardens and protein secretion in gram-negative bacteria. Trends Biochem. Sci. 18, 7–12. Santini, C.-L., Ize, B. Chanal, A., Mu¨ ller, M., Giordano, G., and Wu L.-F., (1998). A novel sec-independent periplasmic protein translocation pathway in Escherichia coli. EMBO J. 17, 101–112. Sargent, F., Bogsch, E., Stanley, N., Wexler, M., Robinson, C., Berks, B., and Pahmer, T. (1998). Overlapping functions of components of a bacterial sec-independent protein export pathway. EMBO J. 17, 3640–3650. Sato, M., Sato, K., and Nakano, A. (1996). Endoplasmic reticulum localization of Sec12p is achieved by two mechanisms: Rer1p-dependent retrieval that requires the transmembrane domain and Rer1p-independent retention that involves the cytoplasmic domain. J. Cell Biol. 134, 279–293. Sato, K., Sato, M., and Nakano, A. (1997). Rer1p as common machinery for the endoplasmic reticulum localization of membrane proteins. Proc. Natl. Acad. Sci. U.S.A. 94, 9693–9698. Schatz, G. (1996). The protein import system of mitochondria. J. Biol. Chem. 271, 31763– 31766. Schatz, G., and Dobberstein, B. (1996). Common principles of protein translocation across membranes. Science 271, 1519–1526. Schekman, R. (1994). Translocation gets a push. Cell 78, 911–913. Schekman, R. (1996). Coat proteins and vesicle budding. Science 271, 1526–1533. Schirmer, T., and Cowan, S. (1993). Prediction of membrane-spanning 웁-strands and its application to maltoporin. Protein Sci. 2, 1361–1363. Schmid, S. (1997). Clathrin-coated vesicle formation and protein sorting: an integrated process. Annu. Rev. Biochem. 66, 511–548. Schneewind, O., Mihaylova-Petkov, D., and Model, P. (1993). Cell wall sorting signals in surface proteins of gram-positive bacteria. EMBO J. 12, 4803–4811. Schneewind, O., Model, P., and Fischetti, V. (1992). Sorting of protein a to the staphylococcal cell wall. Cell 70, 267–281.

4997 / C9-342 / 03-27-00 09:52:21

342

KENTA NAKAI

Schneider, G., Sjo¨ ling, S., Wallin, E., Wrede, P., Glaser, E., and von Heijne, G. (1998). Feature-extraction from endopeptidase cleavage sites in mitochondrial targeting peptides. Proteins: Struct. Funct. Genet. 30, 49–60. Schneider, G., and Wrede, P. (1993). Signal analysis of protein targeting sequences. Protein Sequence Data Analysis 5, 227–236. Schnell, D. (1998). Protein targeting to the thylakoid membrane. Annu. Rev. Plant Physiol. Mol. Biol. 49, 97–126. Schweizer, A., Kornfeld, S., and Rohre, J. (1997). Proper sorting of the cation-dependent mannose 6-phosphate receptor in endosomes depends on a pair of aromatic amino acids in the cytoplasmic tail. Proc. Natl. Acad. Sci. U.S.A. 94, 14471–14476. Seaman, M., McCaffery, J., and Emr, S. (1998). A membrane coat complex essential for endosome-to-Golgi retrograde transport in yeast. J. Cell Biol. 142, 665–681. Seligman, L., and Manoil, C. (1994). An amphipathic sequence determinant of membrane protein topology. J. Biol. Chem. 269, 19888–19896. Settles, A., and Martienssen, R. (1998). Old and new pathways of protein export in chloroplasts and bacteria. Trends Cell Biol. 8, 494–501. Shapiro, L. (1993). Protein localization and asymmetry in the bacterial cell. Cell 73, 841–855. Siegel, V. (1995). A second signal recognition event required for translocation into the endoplasmic reticulum. Cell 82, 167–170. Siegel, V. (1997). Recognition of a transmembrane domain: another role for the ribosome? Cell 90, 5–8. Simonen, M., and Palva, I. (1993). Protein secretion in Bacillus species. Microbiol. Rev. 57, 109–137. Singer, S. (1990). The structure and insertion of integral proteins in membranes. Annu. Rev. Cell Biol. 6, 247–296. Sipos, L., and von Heijne, G. (1993). Predicting the topology of eukaryotic membrane proteins. Eur. J. Biochem. 213, 1333–1340. Sjo¨ stro¨ m, M., Wold, S., Wieslander, A., and Rilfors, L. (1987). Signal peptide amio acid sequences in Eschericha coli contain information related to final protein localization. A multivariate data analysis. EMBO J. 6, 823–831. Smith, H., and Raikhel, N. (1999). Protein targeting to the nuclear pore. What can we learn from plants? Plant Physiol. 119, 1157–1163. Song, M.-C., Shimokata, K., Kitada, S., Ogishima, T., and Ito, A. (1996). Role of basic amino acids in the cleavage of synthetic peptide substrate by mitochondrial processing peptidase. J. Biochem. (Tokyo) 120, 1163–1166. Sonnhammer, E., von Heijne, G., and Krogh, A. (1998). A hidden Markov model for predicting transmembrane helices in protein sequences. Intell. Syst. Mol. Biol. 6, 175–182. Spiess, M. (1995). Head or tails—what determines the orientation of proteins in the membrane. FEBS Lett. 369, 76–79. Staden, R. (1999). Finding protein coding regions in genomic sequences. In Doolittle, R. (ed.), ‘‘Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences, Methods in Enzymology.’’ vol. 183. Academic Press, San Diego. Steenaart, N., and Shore, G. (1997). Alteration of a mitochondrial outer membrane signal anchor sequence that permits its insertion into the inner membrane: contribution of hydrophobic residues. J. Biol. Chem. 272, 12057–12061. Strittmatter, S., Valenzuela, D., Kennedy, T., Neer, E., and Fishman, M. (1990). G0 is a major growth cone protein subject to regulation by Gap-43. Nature 344, 836–841.

4997 / C9-343 / 03-27-00 09:52:21

PROTEIN SORTING SIGNALS AND PREDICTION OF SUBCELLULAR LOCALIZATION

343

Struyve´ , M., Moons, M., and Tommassen, J. (1991). Carboxy-terminal phenylalanine is essential for the correct assembly of a bacterial outer membrane protein. J. Mol. Biol. 218, 141–148. Stuart, R., and Neupert, W. (1996). Topogenesis of inner membrane proteins of mitochondria. Trends Biochem. Sci. 21, 261–267. Subramani, S. (1998). Components involved in peroxisome import, biogenesis, proliferation, turnover, and movement. Physiol. Rev. 78, 171–188. Takeda, J., and Kinoshita, T. (1995). GPI-anchor biosynthesis. Trends Biochem. Sci. 20, 367–371. Talcott, B., and Moore, M. (1999). Getting across the nuclear pore complex. Trends Cell Biol. 9, 312–318. Tan, P., Howard, J., and Payne, G. (1996). The sequence NPFXD defines a new class of endocytosis signal in Saccharomyces cerevisiae. J. Cell Biol. 135, 1789–1800. Taylor, W., Jones, D., and Green, N. (1994). A method for 움-helical integral membrane protein fold prediction. Proteins: Struct. Funct. Genet. 18, 281–294. Teasdale, R., and Jackson, M. (1996). Signal-mediated sorting of membrane proteins between the endoplasmic reticulumand the Golgi apparatus. Annu. Rev. Cell Dev. Biol. 12, 27–54. Thanassi, D. G., Saulino, E. T., and Hultgren, S. J. (1998). The chaperone/usher pathway: a major terminal branch of the general secretory pathway. Curr. Opin. Microbiol. 1, 223–231. Tjalsma, H., Kontinen, V., Pra´ gai, Z., Wu, H., Meima, R., Venema, G., Bron, S., Sarvas, M., and van Dijl, J. (1999). The role of lipoprotein processing by signal peptidase II in the gram-positive eubacterium Bacillus subtilis. J. Biol. Chem. 274, 1698–1707. Turner, R., and Weiner, J. (1993). Evaluation of transmembrane helix prediction methods using the recently defined NMR structures of the coat proteins from bacteriophages m13 and pf1. Biochim. Biophys. Acta 1202, 161–168. Tusna´ dy, G. and Simon, I. (1998). Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J. Mol. Biol. 283, 489–506. Udenfriend, S., and Kodukula, K. (1995a). How glycosylphosphatidylinositol-anchored membrane proteins are made. Annu. Rev. Biochem. 64, 563–591. Udenfriend, S., and Kodukula, K., (1995b). Prediction of omega site in nascent precursor of glycosylphosphatidylinositol protein. Methods Enzymol. 250, 571–582. Ulbrandt, N., Newitt, J., and Bernstein, H. (1997). The E-coli signal recognition particle is required for the insertion of a subset of inner membrane proteins. Cell 88, 187–196. Vallee, R., and Sheetz, M. (1996). Targeting of motor proteins. Science 271, 1539–1544. van Klompenburg, W., Nilsson, I., von Heijne, G., and de Kruijff, B. (1997). Anionic phospholipids are determinants of membrane protein topology. EMBO J. 14, 4261– 4266. Vandromme, M., Gauthier-Rouvie`re, C., Lamb, N., and Fernandez, A. (1996). Regulation of transcription factor localization: fine-tuning of gene expression. Trends Biochem. Sci. 21, 59–64. Vijayasaradhi, S., Xu, Y., Bouchard, B., and Houghton, A. (1995). Intracellular sorting and targeting of melanosomal membrane proteins: identification of signals for sorting of the human brown locus protein, gp75. J. Cell Biol. 130, 807–820. von Heijne, G. (1984). How signal sequences maintain cleavage specific. J. Mol. Biol. 173, 243–251. von Heijne, G. (1985). Signal sequences. the limits of variation. J. Mol. Biol. 184, 99–105. von Heijne, G. (1986a). The distribution of positively charged residues in bacterial inner membrane proteins correlates with the trans-membrane topology. EMBO J. 5, 3021– 3027.

4997 / C9-344 / 03-27-00 09:52:21

344

KENTA NAKAI

von Heijne, G. (1986b). A new method for predicting signal sequence cleavage sites. Nucl. Acids Res. 14, 4683–4690. von Heijne, G. (1988). Transcending the impenetrable: how proteins come to terms with membranes. Biochim. Biophys. Acta 947, 307–333. von Heijne, G. (1989). The structure of signal peptides from bacterial lipoproteins. Protein Eng. 2, 531–534. von Heijne, G. (1992). Membrane protein structure prediction: hydrophobicity analysis and the positive-inside rule. J. Mol. Biol. 225, 487–494. von Heijne, G. (1994). Membrane proteins: from sequence to structure. Annu. Rev. Biophys. Biomol. Struct. 23, 167–192. von Heijne, G. (1995). Membrane protein assembly: rules of the game. Bioessays 17, 25–30. von Heijne, G. (1997). Getting greasy: how transmembrane polypeptide segments integrate into the lipid bilayer. Mol. Microbiol. 24, 249–253. von Heijne, G. (1998). Life and death of a signal peptide. Nature 396, 111–113. von Heijne, G., and Gavel, Y. (1988). Topogeneic signals in integral membrane proteins. Eur. J. Biochem. 174, 671–678. von Heijne, G., Steppuhn, J., and Herrmann, R. (1989). Domain structure of mitochondrial and chloroplast targeting peptides. Eur. J. Biochem. 180, 535–545. Wahlberg, J., and Spiess, M. (1997). Multiple determinants direct the orientation of signal-anchor proteins: the topogenic role of the hydrophobic signal domain. J. Cell Biol. 137, 555–562. Wais, K. (1998). Importins and exportins: how to get in and out of the nucleus. Trends Biochem. Sci. 23, 185–189. Walter, P. (1992). Travelling by tram. Nature 357, 22–23. Weiner, J., Bilous, P., Shaw, G., Lubitz, S., Frost, L., Thomas, G., Cole, J., and Turner, R. (1998). A novel and ubiquitous system for membrane targeting and secretion of cofactor-containing proteins. Cell 93, 93–101. Wozniak, R., Rout, M., and Aitchison, J. (1998). Karyopherins and kissing cousins. Trends Cell Biol. 8, 184–188. Yamaguchi, K., Yu, F., and Inouye, M. (1988). A single amino acid determinant of the membrane localization of lipoproteins in E. coli. Cell 53, 423–432. Yost, C., Lopez, C., Prusiner, S., Myers, R., and Lingappa, V. (1990). Non-hydrophobic extracyto-plasmic determinant of stop transfer in the prion protein. Nature 343, 669–672. Young, G. M., Schmiel, D. H., and Miller, V. L. (1999). A new pathway for the secretion of virulence factors by bacteria: the flagellar export apparatus functions as a proteinsecretion system. Proc. Natl. Acad. Sci. U.S.A. 96, 6456–6461. Zheng, N., and Gierasch, L. (1996). Signal sequences: the same yet different. Cell 86, 849–852.