Twenty-five years of nomenclature and classification of proteolytic enzymes

Twenty-five years of nomenclature and classification of proteolytic enzymes

Journal Pre-proof Twenty-five years of nomenclature and classification of proteolytic enzymes Neil D. Rawlings PII: S1570-9639(19)30238-9 DOI: htt...

3MB Sizes 0 Downloads 21 Views

Journal Pre-proof Twenty-five years of nomenclature and classification of proteolytic enzymes

Neil D. Rawlings PII:

S1570-9639(19)30238-9

DOI:

https://doi.org/10.1016/j.bbapap.2019.140345

Reference:

BBAPAP 140345

To appear in:

BBA - Proteins and Proteomics

Received date:

4 November 2019

Revised date:

9 December 2019

Accepted date:

11 December 2019

Please cite this article as: N.D. Rawlings, Twenty-five years of nomenclature and classification of proteolytic enzymes, BBA - Proteins and Proteomics(2019), https://doi.org/10.1016/j.bbapap.2019.140345

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier.

Journal Pre-proof Twenty-five years of Nomenclature and Classification of Proteolytic Enzymes Neil D. Rawlings European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK E-mail: [email protected] Abstract

oo

f

Proteolytic enzymes and their homologues have been classified into clans by comparing the tertiary structures of the peptidase domains, into families by comparing the protein sequences of the peptidase domains, and into protein-species by comparing various attributes including domain architecture, substrate preference, inhibitor interactions, subcellular location, and phylogeny. The results are compared with the earlier classification by Rawlings & Barrett (1993). The numbers of sequences, protein-species, families, clans and even catalytic type have substantially increased during the intervening 26 years. The alternative classifications by catalytic type and/or activity are shown not to reflect evolutionary relationships.

pr

Introduction

Jo u

rn

al

Pr

e-

It has been over a quarter of a century since Alan Barrett and I published the first classification of proteolytic enzymes based on structural and sequence relationships that we believed mirrored evolutionary relationships (Barrett & Rawlings, 1993). This classification has been maintained in the intervening period via the MEROPS website and database (https://www.ebi.ac.uk/merops/; Rawlings et al., 2018). The classification groups related structures into a clan and homologous sequences into a family, which provides a hierarchical classification system in which a clan contains one or more families. In much the same way that a type specimen exists for every species of organism, we have assigned a representative tertiary structure deposited in the PDB database (Burley et al., 2019) to be a type example for each clan. To be included in the same clan, a tertiary structure must be significantly similar to that of the type example of that clan. One well-characterized sequence is chosen to represent the type example for each family. To belong to the same family as a type example, a sequence must be homologous to that of the type example and the positions of the catalytic residues must be preserved. Two other, optional, classification levels were also included. A clan could be divided into subclans, if in one or more families the active site residues (or the metal ligands of a metallopeptidase) were different from that of the type example. A family in which there were ancient and discrete lineages could be divided into subfamilies. A lower level was introduced with the publication of the first edition of the Handbook of Proteolytic Enzymes (Barrett et al., 1998). This lowest level recognized that a single family included peptidase activities that could be distinguished from one another, which we termed “protein-species”. The criteria we used for determining what constituted a “peptidase-species” were outlined in Barrett & Rawlings, 2007 (see Table 1). We have assigned one well-characterized sequence to represent the “holotype” for each peptidase-species. The task is to identify sequences from different organisms that represent the same proteolytic activity, even though in the majority of cases there is no experimental characterization except for the holotype. Thus, the hierarchy in MEROPS can be up to five levels, in descending order: clan, subclan, family, subfamily, and peptidase-species. Table 1 near here

Journal Pre-proof

rn

al

Pr

e-

pr

oo

f

Each level in the hierarchy is given a unique identifier. The first character in each identifier indicates the catalytic type: “A” for aspartic, “C” for cysteine, “G” for glutamic, “M” for metallo, “S” for serine and “T” for threonine, plus “N” for asparagine peptide lyases, “U” for unknown catalytic type and “P” for mixed catalytic type (because all peptidases involved have protein nucleophiles). A clan name consists of two letters, with the second being assigned alphabetically. A subclan name consists of the clan name with a third letter added in parenthesis, either indicating the catalytic type for peptidases with protein nucleophiles or indicating the nature of a metal ligand in metallopeptidases (e.g. “E” for a Glu-zincin and “M” for a Met-zincin). A family name consists of a letter to indicate the catalytic type followed by a serial number. A subfamily has a letter added to the family name, which is assigned alphabetically. A peptidase -species (also known as a MEROPS identifier) consists of the family name padded to three characters with zeroes, followed by a decimal point and then a serial number. Special MEROPS identifiers are created for non-peptidase homologues (homologues which are not active as peptidases) in which the first character after the decimal point is “9”, pseudogenes (the first character after the decimal point is “P”) and uncharacterized homologues from model organisms in which the first character after the decimal point is “A”, “B” or “C”. Holotypes from model organisms were added to MEROPS at different times. Holotypes from human and mouse date from the first edition of the Handbook of Proteolytic Enzymes (Barrett et al., 1998), and holotypes from the other organisms were added as follows: the plant Arabidopsis thaliana (Dec 2008); the fruit fly Drosophila melanogaster, the nematode Caenorhabditis elegans and the Gram-negative bacterium Escherichia coli (Jul 2011); the yeasts Saccharomyces cerevisiae and Schizosaccharomyces pombe, the slime mould Dictyostelium discoideum, the Gram positive bacterium Bacillus subtilis and the archaean Pyrococcus furiosus (Dec 2012); the malaria parasite Plasmodium falciparum (Jul 2014); and zebrafish (Danio rerio; Jul 2014). As examples of the nomenclature, the full classification of chymotrypsin A is: clan P A, subclan PA(S), family S1, subfamily S1A, MEROPS identifier S01.001; that of matrix metallopeptidase 1 is: clan MA, subclan MA(M), family M10, subfamily M10A, MEROPS identifier M10.001; that of calpamodulin (a non-peptidase homologue of calpain) is: clan CA, family C2, MEROPS identifier C02.971; and that of the product of the At2g28010 gene from A. thaliana: clan AA, family A1, subfamily A1B, MEROPS identifier A01.A18.

Jo u

The term “peptidase unit” was defined to refer to the domain which bears the catalytic machinery. This often consists of two subdomains with the active site residues (and metal ligands of a metallopeptidase) and a substrate-binding groove between the subdomains. In addition to the catalytic residues, the specificity of a peptidase is defined by residues that line this groove. These are not necessarily conserved in a family of peptidases, so that peptidases within that family have different specificities. The Schechter-Berger nomenclature (Schechter & Berger, 1967) of the substrate binding sites and the residues of the substrate that occupy them is shown in Fig. 1. Some peptidases have an additional substrate-binding site known as an exosite to further restrict specificity and this can be some distance from the active site. Some peptidases have additional domains that either restrict access to the active site or bind specific regions of the substrate, further limiting the specificity of the peptidase. Some proteins contain more than one peptidase unit. There is a MEROPS identifier for each peptidase unit, and a special identifier for entire protein (a “compound peptidase”), in which the family name part of the identifier is preceded by an “X” (an example is metallocarboxypeptidase D, which has two functional peptidase units and the MEROPS identifier XM14.001). A similar special identifier exists when the functional enzyme consists of several subunits, two or more of which are also peptidases (a “complex peptidase”); an example is the 20S proteasome, XT01.001.

Journal Pre-proof There are many peptidases described in the literature for which the sequence is either unknown or only a fragment is known that is too short to use for sequence searching. Such peptidases are included in MEROPS as “unsequenced”. Each is given a unique identifier that is similar to that of a sequenced peptidase except that the three characters before the decimal point are as follows: the first character indicates the catalytic type, the second character is the number nine, and the third character indicates the activity (A = aminopeptidase; B = dipeptidase; C = dipeptidyl-peptidase; D = peptidyl-dipeptidase; E = carboxypeptidase; F = omega peptidase; G= endopeptidase).

Jo u

rn

al

Pr

e-

pr

oo

f

Prior to the existence of the MEROPS classification, peptidases were grouped either according to very broad specificities, or according to the nature of the nucleophile in the catalytic reaction (known as the “catalytic type”). In Enzyme Nomenclature, a hybrid of the two systems is used for the classification of peptidases (NC-IUBMB, 1992). None of these systems represents the evolutionary relationships between peptidase sequences, because a family of peptidases may include both endopeptidases and exopeptidases (such as family C1, which contains homologues of the endopeptidase papain including the exopeptidase dipeptidyl-peptidase I (Ishido et al., 1991), and family S8 which contains homologues of the endopeptidase subtilisin including tripeptidylpeptidases (Tomkinson et al., 1987)) and a clan may include peptidases of different catalytic types (such as clan PA, which includes serine peptidases such as chymotrypsin and cysteine peptidases such as picornain 3C from polio virus (Allaire et al., 1994), and clan PB, which includes Ntnhydrolases that have threonine, serine or cysteine as a nucleophile (Oinonen & Rouvinen, 2000)). Chymotrypsin, a mammalian digestive enzyme from the intestine, and subtilisin, an enzyme from a bacterium, have similar activities and specificities as endopeptidases, but from their amino acid sequences it was clear that the proteins were unrelated (Smith et al., 1966). From the resolution of their crystal structures, it was also clear that the tertiary structures were unrelated, but the active sites of the two peptidases were superimposable, although one was the mirror image of the other (Alden et al., 1970). This was evidence of convergent evolution, but a different form of convergence than that which is observed between organisms, because there is no known common ancestor for the proteins. In MEROPS, catalytic type is used in the nomenclature of clans, families and MEROPS identifiers, but because a clan (and two families) can contain a mixture of catalytic types, it does not form the highest level of classification. The introduction of the lowest level in the MEROPS classification enabled us to incorporate the nomenclature style of Enzyme Nomenclature for the naming of individual peptidases. Peptidases are usually divided into exopeptidases and endopeptidases. An exopeptidase cleaves only one of the three N- or C-terminal bonds in a peptide or protein, whereas an endopeptidase can cleave a bond much further from either terminus. An endopeptidase may also cleave a bond near an N or C terminus, but it does not require a free N or C terminus, unlike an exopeptidase. By default, a peptidase is an endopeptidase if it cleaves a bond distant from the peptide termini, regardless whether it also cleaves a bond near either terminus. Endopeptidases can only be further categorized by reference to their catalytic type. There are, however, endopeptidases that are unable to cleave proteins and can only cleave short peptides. These are known as “oligopeptidases”. The term “oligopeptidase” was originally applied to an Escherichia coli enzyme that cleaved di- and trilysine (Simmonds et al., 1976) and is now known to be oligopeptidase A. Exopeptidases are categorized according to the number of amino acids released from a peptide or protein and whether this is at the N or C terminus. An aminopeptidase (Enzyme Nomenclature subsubclass EC 3.4.11) releases an amino acid from the N terminus, and a carboxypeptidase (EC 3.4.16, EC 3.4.17 and EC 3.4.18) releases an amino acid from the C terminus. A dipeptidase (EC 3.4.13) cleaves only a dipeptide. An exopeptidase that releases a dipeptide from the N terminus of a

Journal Pre-proof peptide or protein is known as a “dipeptidyl-peptidase” (also erroneously known as a “dipeptidylaminopeptidase” in the literature), and one that releases a dipeptide from the C terminus of a substrate is known as a “peptidyl-dipeptidase” (EC 3.4.15). A tripeptidyl-peptidase releases a tripeptide from the N terminus of a substrate. Dipeptidyl- and tripeptidyl-peptidases are both included in Enzyme Nomenclature sub-subclass EC 3.4.14. These activities are shown in Fig. 1. Fig. 1 near here

Pr

e-

pr

oo

f

There are peptidases with activities that do not satisfy the definitions of endo- and exopeptidases. There are isopeptidases that cleave isopeptide bonds such as those by which proteins are tagged. Proteins are tagged as signals for degradation or export from one cellular compartment to another, for example from the nucleus to the cytoplasm. Removing these tags prevents degradation or export and allows for recycling of the tag. The C terminus of ubiquitin is attached to the amino side chain of a lysine to target the protein for degradation by the proteasome. The ubiquitin tag is removed by a deubiquitinating enzyme (also known as a DUB; Mevissen & Komander (2017)). DUBs have previously been classified into the following four groups: ubiquitin-specific proteases (USPs, family C19 in MEROPS), ubiquitin C-terminal hydrolases (UCHs, family C12), the ovarian tumour family (OTUs, family C65) and the Machado-Josephin domain family (MJDs, family C86) (Amerik & Hochstrasser, 2004). Other peptidases are able to release a modified N-terminal amino acid, which suggests an exopeptidase activity, except that the N terminus is blocked. Examples include pyroglutamate-peptidase I, which releases an N-terminal pyroglutamate (De Gandarias et al., 1994), and acylaminoacyl-peptidase, which releases an N-terminal acylated amino acid (Kiss et al., 2007). Such peptidases have been called “omega peptidases” (EC 3.4.19), a term invented by McDonald & Barrett (1986).

Jo u

rn

al

The formation of a peptide bond is a dehydration reaction in which a water molecule is released. It therefore follows that breaking a peptide bond is most frequently by hydrolysis, in which a water molecule is consumed. In some peptidases, the nucleophile in the reaction is the side chain of an amino acid (known as a “protein nucleophile”), either the hydroxyl group of a serine or threonine, or the thiol group of a cysteine. In other peptidases, a water molecule is activated to become the nucleophile (known as a “water nucleophile”) and is bound either to an aspartic acid, glutamic acid or a metal ion. The nature of the nucleophile determines the catalytic type of the peptidase, which can be serine, threonine, cysteine, aspartic, glutamic or metallo. Glutamic peptidases were unknown in 1993 and the catalytic type was discovered in 2004 (Fujinaga et al., 2004). In the peptidolytic reaction, if water is replaced by another solvent then a moiety may be attached to the new N terminus of the peptide. If this occurs, then the enzyme is described as a “transpeptidase”. An example of a transpeptidase is gamma-glutamyl transferase, which releases gamma-Glu from a substrate such as glutathione and transfers it to the N terminus of another amino acid or dipeptide (Tate & Meister, 1985). Another example is the D -Ala-D-Ala-carboxypeptidase which releases a C-terminal D -Ala from the precursor of the crosslinking peptide of a bacterial cell wall but can also catalyse a transpeptidation reaction (Perkins et al., 1973). More residues additional to the nucleophile are required for the catalytic mechanism and form the active site of the peptidase. In many peptidases with a protein nucleophile, a catalytic triad exists. Frequently a histidine residue acts a general base, and a third residue, often an aspartic acid, is required to orientate and polarize the histidine ring. Many hydrolases other than peptidases also utilize a catalytic triad, including amidases, esterases, acylases, lipases and β-lactamases (Rauwerdink & Kazlauskas, 2015).

Journal Pre-proof The hydrolysis of a peptide bond by a peptidase with a protein nucleophile progresses in stages. The first stage is the formation of a tetrahedral intermediate in which the carbonyl carbon of the substrate is forced to accept an electron by the nucleophile. The build-up of negative charge on the intermediate has to be stabilized by a residue or residues of the peptidase that form the “oxyanion hole” (Robertus et al., 1972). In the second stage of the reaction, the tetrahedral intermediate collapses back to a carbonyl ejecting the first product of the reaction, often with an electron being donated to this first leaving group by the histidine. A second intermediate forms, the acyl -enzyme intermediate, with the remainder of the substrate bound to the enzyme. A water molecule then acts as a second nucleophile to resolve the second intermediate, releasing the second product and the peptidase. Where a single residue acts as the oxyanion hole, this may be considered as a fourth component of a catalytic tetrad.

oo

f

Not all peptidases with a protein nucleophile require a catalytic triad. In some a catalytic dyad is sufficient, and in N-terminal hydrolases in which an N-terminal threonine, serine or cysteine acts as the nucleophile, the amino group of the same N-terminal residue also acts as the general base (Brannigan et al., 1995).

Jo u

rn

al

Pr

e-

pr

The mechanism in peptidases with a water nucleophile is somewhat different. In aspartic and glutamic peptidases, the water is bound to two aspartic or glutamic acid residues; respectively. In metallopeptidases, the water is bound to a metal ion, which is in turn bound to three residues of the peptidase. Often two of the metal ligands occur within a motif such as HEXXH (Jongeneel et al., 1989), with the third ligand C-terminal to this motif. In the HEXXH motif, the histidines are the metal ligands and the glutamic acid is an active site residue. Metallopeptidases with an HEXXH motif have been termed “zincins”, because the majority bind a zinc ion. A zincin in which the third zinc ligand is a glutamic acid is known as a “Glu-zincin” (Hooper, 1994). Some zincins have an extended HEXXHXXGXXH/D motif in which the third histidine or an aspartic acid is the third metal ligand, and these are known as “Met-zincins” (because a methionine is also required for activity) and “Aspzincins” (Bode et al., 1993; Fushimi et al., 1999). Another clan (ME) of metallopeptidases has the HEXXH motif reversed (HXXEH) and these are known as “inverzincins” (Hooper, 1994). Zincins correlate to clan MA in MEROPS, with Glu-zincins, Met-zincins and Asp-zincins forming subclans MA(E), MA(M) and MA(D). At least one member of clan MA is known to bind copper rather than zinc and the first histidine in the HEXXH motif is replaced by glutamine (Heitzer & Hallman, 2002). Other residues, besides the Glu in the HEXXH motif, are required for catalysis and the number of such residues differs between families of metallopeptidases. Some metallopeptidases have cocatalytic metal ions, in which each metal is bound to three residues of the peptidase, but one of these residues binds both metals so that there are only five ligands. A peptide bond is planar and very stable, which is why hydrolysis requires enzymatic activity. However, some peptide bonds are either energetically unstable, or can be induced to be so. Then self-cleavage of the peptide bond can occur. Some proteins cleave themselves when an asparagine residue is induced to cyclize to form a succinimide and these are known as asparagine peptide lyases (Rawlings et al., 2011). This group includes intein-containing proteins, which form subclan PD(N), and are structurally related to the self-cleaving hedgehog proteins (subclan PD(C); Mizutani et al., 2001). An intein is a functional protein, often an enzyme, which is released following two cleavage events, one at the N terminus and one at the C terminus of the intein (it is the C-terminal cleavage which depends on asparagine peptide lyase activity). Splicing then occurs to rejoin the fragments of the “host” protein which are known as the N- and C-exteins (Noren et al., 2000). The MEROPS classification is widely recognized and has been included in UniProt since 1993. We collaborate closely with the Pfam, InterPro and Panther databases to ensure that coverage within a

Journal Pre-proof family is approximately the same in all databases (Studholme et al., 2003; Rawlings et al., 2018). In this paper, the classification of peptidases in the latest release of the MEROPS database and website is compared with that from the original 1993 Rawlings & Barrett paper. The distribution of activities amongst peptidase families and different catalytic mechanisms amongst clans will also be examined. Materials and Methods Materials. Protein sequences are taken from the UniProt (UniProt Consortium, 2019) and NCBI Protein sequence databases (Sayers et al., 2019). Structures are taken from the Protein Data Bank (Burley et al., 2019).

pr

oo

f

Sequence comparison. Sequence similarity was compared using either BlastP (Altschul et al., 1990), FastA (Pearson, 1990) or the Hmmer website (Potter et al., 2018). These were used remotely to search either the UniProt or NCBI Protein databases. FastA and BlastP were also installed locally and were used to search the MEROPS sequence collection. Only the peptidase unit from a peptidase sequence was submitted for searching, so that no false positives with domains other than the peptidase domain were encountered. A sequence was considered to be a homologue if in the alignment with a known peptidase the peptidase units overlap with an expect (E) value of 0.001 or less. When FastA or BlastP were performed locally against the MEROPS sequence collection, which has far less sequences, a lower E value of e -10 was considered to represent homology.

Pr

e-

Structure comparison. The Dali server was used to compare PDB structures (Holm &Laakso, 2016). Two structures were considered to be significantly similar if the Z score was 6 standard deviation units or more. Alignments and phylogenetic trees: Muscle was used to align protein sequences (Edgar, 2004). Phylogenetic trees were created using Quicktree (Howe et al., 2002).

Jo u

rn

al

Literature searching. PubMed was searched weekly using a bespoke algorithm that included, amongst other terms, peptidase names and names of researchers known to be prominent in the field. From this literature, data were extracted to form the MEROPS substrate cleavage site collection (Rawlings, 2009) and the collection of peptidase-inhibitor interactions (Rawlings et al., 2008). Bespoke programs were written to extract cleavages from the proteomics data provided as supplementary files attached to some publications. From the literature well-characterized peptidases were identified to represent type examples, and details of active site residues and metal ligands were derived. Criteria for distinguishing a peptidase from a homologous holotype. The criteria are shown in Table 1. Criteria 1 to 5 can be determined from the comparison of the protein sequences or by generation of a phylogenetic tree for the family or subfamily in question. Criteria 6-9 can only be determined experimentally and for these data were acquired from the literature. Classifying a new peptidase activity. Fig. 2 shows a flow diagram of the classification process. When an apparently novel peptidase is described in the literature, if the sequence is not known, then the peptidase is classified as “unsequenced” and an identifier assigned. If the sequence is known, it is a homologue in an existing family and passes the tests to be considered a unique activity, then a new identifier is created in that family and orthologues identified from the phylogenetic tree. If the activity is not sufficiently different from that of an existing activity, then the sequence is assigned to the MEROPS identifier of the existing activity. If the sequence is unrelated to an existing family, then a new family is created and homologues are searched for in UniProt and/or NCBI Protein.

Journal Pre-proof Fig. 2 near here Results Table 2 shows counts of sequences, families and clans from the original 1993 paper (Rawlings & Barrett, 1993) and the current release (12.1) of the MEROPS database. The table also compares the number of protein species from the first edition of the Handbook of Proteolytic Enzymes (Barrett et al., 1998) with the current release of MEROPS. All of these items have increased substantially, but there has been an astronomical increase in sequences (1682-fold), thanks mainly to the advances in genome and metagenome sequencing. Table 2 near here

pr

oo

f

Table 2 also shows counts of items from the current MEROPS release that were not included in the 1993 paper, namely references, substrate cleavages and peptidase-inhibitor interactions. Of the substrate cleavages, 58,429 are in proteins and peptides but are non-physiological, 27,713 are physiological, 6,111 are from synthetic substrates, 1,403 are pathological, and 1,997 are theoretical. Of the peptidase-inhibitor interactions, 1,830 are complexes with inhibitor proteins and 4,653 are complexes with small molecule inhibitors.

al

Pr

e-

The distribution amongst different catalytic types is shown in Table 3. The catalytic type with most sequences is serine (421,996), which also has most protein-species (1,710). However, the catalytic type with most families is cysteine (96), because many RNA viruses have one or more cysteine peptidases as a component of and for processing the polyprotein and because of the high mutation rate in viruses, peptidase sequences show little similarity between families of viruses, which results in many families of viral cysteine peptidases (Rawlings & Bateman, 2019). There are 6,577 sequences from nine families where the catalytic type is either unknown or ambiguous. Additionally, there are two families containing 6,588 sequences in which the catalytic type can be either serine, threonine or cysteine.

Jo u

rn

There are also 195 peptidases for which the sequence is unknown or the known sequence fragment is too short to enable classification. These include nine aspartic endopeptidases, 13 cysteine endopeptidases, 79 metallopeptidases (all endopeptidases except eight aminopeptidases, two dipeptides, two peptidyl-dipeptidases, and four carboxypeptidases), 69 serine peptidases (all endopeptidases except one dipeptidyl-peptidase and one omega peptidase), and 26 peptidases of unknown catalytic mechanism (all endopeptidases except three dipeptidases, one dipeptidylpeptidase, one carboxypeptidase and one omega peptidase). Table 3 near here Remarkably, the number of clans has risen by eightfold. Almost all peptidase families now contain at least one homologue of known structure, and there are only forty families for which the structure is not known. The catalytic type with most clans is metallo (16). Some serine, cysteine and threonine peptidases have similar tertiary structures so there are five clans of mixed type. All threonine peptidases with known structures are related to serine and cysteine peptidases. N o structures are known for peptidases of unknown catalytic type. The full MEROPS classification of peptidases is shown in Supplementary Table 1. This does not include non-peptidase homologues, pseudogenes or uncharacterized peptidases that are not orthologues of existing peptidases.

Journal Pre-proof Most peptidases are endopeptidases, and there are examples from every catalytic type. It is difficult to draw conclusions about endopeptidase specificity from the MEROPS classification, but it can be noted that endopeptidase from clans PA (chymotrypsin), PB (Ntn hydrolases) and CD (caspases and homologues) mostly have specificity directed towards the residue that occupies the P1 position in the substrate.

Pr

e-

pr

oo

f

The different types of exopeptidase are not confined to a single catalytic type or family. Table 4 lists the clans, families and examples of exopeptidases. Dipeptidases can be cysteine, metallo, serine or threonine type; aminopeptidases and dipeptidyl-peptidases can be cysteine, metallo or serine type; carboxypeptidases can be metallo or serine type; whereas peptidyl -dipeptidases are all metallo-type and tripeptidyl-peptidases are all serine-type. There are no exopeptidases of aspartic or glutamic type. Identification of catalytic types of exopeptidases does not, however, show their independent origins. From Table 4, it can be seen that dipeptidases fall into seven clans including four clans of metallopeptidases; aminopeptidases into eleven clans including six clans of metallopeptidases; carboxypeptidases into seven clans including four metallopeptidase clans and three serine peptidase clans; dipeptidyl-peptidases into five clans; tripeptidyl-peptidases into two serine clans; and only peptidyl-dipeptidases are derived from a single clan (MA, zincins) and have a single evolutionary origin. A family may contain more than one exopeptidase activity: C1 includes an aminopeptidase and a dipeptidyl-peptidase (and endopeptidases); M15 and M20 contain dipeptidases and carboxypeptidases; C69 and M24 contain aminopeptidases and dipeptidases; M28 and S12 contain aminopeptidases and carboxypeptidases; S28 contains a carboxypeptidase and a dipeptidyl peptidase; S33 contains an aminopeptidase, a dipeptidase, dipeptidyl-peptidases and tripeptidylpeptidases. The clan with most different exopeptidase activities is SC (alpha-beta hydrolases).

rn

al

Omega-peptidases include deubiquitinating enzymes in families C12, C19, C64, C76, C87 (all from clan CA) and M67; desumoylating and deneddylating enzymes in C48 (clan CE) and C97 (clan CP); pyroglutamyl peptidases in C15 and M1; acylamino-acyl peptidases in S9; gamma-glutamyl hydrolases in C26; gamma-glutamylcysteine dipeptidyltranspeptidase in C83; gamma-D -glutamyl-( L)meso-diaminopimelate peptidase I in M14; and gamma-glutamyltransferases in T3. These also represent different catalytic types and evolutionary origins.

Jo u

The known active site residues (and metal ligands for metallopeptidases) are shown for each clan and subclan in Table 5. In 1993, we assumed that all peptidases with the same order of catalytic residues in the protein sequence would have similar tertiary structures, and suggested knowing that order would establish to which clan a family of peptidases would belong, even i f the tertiary structure had not been solved. Although this still might hold true for a peptidase with a catalytic triad or tetrad, it is no longer possible to assign a peptidase of unknown structure to a clan when only a catalytic dyad is known. Another assumption made in 1993 was that all peptidase within a clan would have the same active site residues in the same order in the sequence. This has also proven not to be the case. In 1993, we recognized only 14 different catalytic mechanisms. Six different arrangements of active site residues for serine peptidases, two for cysteine peptidases, one for aspartic peptidases, and five for metallopeptidases. The only clans recognized then were SA, SE, CA, CB, AA and MA. All of these clan names are still current, except for clan SA which is now known as PA following the discovery that cysteine endopeptidases from polio and related viruses share a tertiary structure with serine endopeptidases such as chymotrypsin (Allaire et al., 1994). In 1993, clans were then not assigned if the structure was confined to one family, but it was clear then that eight families had different catalytic mechanisms to those assigned to clans, namely S8 (now in clan SB, subtilases), S9 and S10

Journal Pre-proof (SC, alpha-beta hydrolases), S14 (SK, ClpP/crotonase), M14 (MC, zinc carboxypeptidases), M15 (MD), M16 (ME, inverzincins) and M17 (MF).

Pr

e-

pr

oo

f

Table 5 shows that considerably more catalytic mechanisms have now been identified. However, it should be noted that not all active site residues are known for some families, especially peptidases from viruses where they are mostly identified by mutagenesis (for example, families in clan CA). The table also shows that peptidases from different clans can have the same order of catalytic residues in the sequence: clans AC (signal peptidase II) and AD (presenilin/type IV prepilin peptidase); CD (caspases), CP (desumoylating isopeptidase), CQ (pestivirus Npro peptidase) and CR (Prp peptidase); CM (hepatitis C virus peptidase 2) and some members of subclan PA(C); CF (pyroglutamyl peptidase I) and some members of subclan PC(C) (picornain); SH (assemblin) and SP (nucleoporin 145); SF (signal peptidase I), SJ (lon protease), SK (ClpP endopeptidase) and SO (endosialidase CIMCD). Clans CA (papain-like) and CE (adenain-like) both have catalytic tetrads (including the residue that helps form the oxyanion hole) but one is the reverse of the other, and it has been suggested that the clans show a remarkable example of convergent evolution or that they had a common original but that a circular permutation of the subdomains occurred (Ding et al., 1996). Peptidases in clans MA (zincins) and MM (S2P protease) both contain the HEXXH motif in which the histidine are metal ligands and the glutamic acid a catalytic residue, but the structures differ and peptidases in clan MM are intramembrane (Feng et al., 2007). Motifs similar to HEXXH occur in peptidases from clans ME (HXXEH, inverzincins) and MU (HEXXXH) but the structures differ (Taylor et al., 2001; Hu et al., 2012). Clans SC (alpha-beta hydrolases) and SS (murein tetrapeptidase) are also similar except that the second residue in the triad is either Asp or Glu.

rn

al

In 1993, we knew only one family of cocatalytic metallopeptidases (M17). It is now apparent that cocatalytic metallopeptidases belong to at least six different clans: MF (leucyl aminopeptidase), MG (methionyl aminopeptidase), MH (aminopeptidase Y), MJ (membrane dipeptidase), MN (DppA aminopeptidase) and MQ (aminopeptidase T), and that bound metals can be zinc, cobalt or manganese ions.

Jo u

There can also be very different catalytic residues in peptidases from different families within a clan. Perhaps the most striking is between families in clan SB (subtilases), in which the subtilisins from family S8 (mostly active at alkaline pH) have an Asp, His, Asn, Ser tetrad, but the sedolisins from family S53 (mostly active at acidic pH) have a Glu, Asp, Asp, Ser catalytic triad (Wlodawer et al., 2001) Table 6 shows the distribution of peptidases amongst the thirteen model organisms for which every peptidase in the proteome has been assigned to a MEROPS identifier. The more complex the organism the more peptidases it has, because the organisms with most are human, mouse and Arabidopsis thaliana. The archaean Pyrococcus furiosus has fewest, but it also has fewest characterized proteins in general, because of its preference for high temperatures and low pH, and many of its uncharacterized proteins could be peptidases of novel families. Plasmodium falciparum, does not seem to have a reduced peptidase complement, as might be expected for a parasite. Table 6 also shows the distribution of peptidases amongst catalytic types. In human, mouse, D. melanogaster and A. thaliana, there are more serine peptidases than any other type. Note that this is not true of Danio rerio or Caenorhabditis elegans, both of which have similar numbers of cysteine, serine and metallopeptidases. A. thaliana has more aspartic peptidases than any of the other model organisms. All the peptidases of mixed catalytic type in the table are from animals and from the family P2; asparagine peptide lyases are found only in microbes; and the uncharacterized peptidases are from bacteria.

Journal Pre-proof Table 6 also shows the number of holotypes from each model organism. From the percentage of peptidases that are holotypes it is clear that most of D. rerio peptidases can be mapped to the same identifiers as peptidases from human or mouse, which is not unexpected. What is more surprising it that 88% of peptidases from D. melanogaster are holotypes, which means that very few peptidases could be mapped to the identifiers used for peptidases from chordates. It is also surprising that 89% of peptidases from C. elegans are holotypes, which means that very few can be mapped to the D. melanogaster holotypes. More than half (58%) of peptidases from the Gram-positive bacterium B. subtilis are holotypes, suggesting that few peptidases are shared with the Gram-negative E. coli. The overall conclusion is that there are very few peptidases that can be identified as orthologues across organism phyla.

oo

f

Table 6 also shows the number of holotypes from each model organism that are uncharacterized biochemically and are known only as sequences. It is surprising that the majority of peptidases are uncharacterized from the well-studied organisms A. thaliana (79%), C. elegans (79%), D. melanogaster (74%) and D. discoideum (72%). Discussion

al

Pr

e-

pr

Throughout this paper, I have used the term “proteolytic enzyme” to indicate an enzyme that catalyses the breaking of a peptide or isopeptide bond, to encompass not only hydrolases but also asparagine peptide lyases. A proteolytic enzyme that is a hydrolase is also known as a protease, proteinase or a peptidase. The oldest of these terms is “protease” (Malfitano, 1900) but it predates “proteolytic enzyme” by only three years (Weis, 1903; Vines, 1903). “Peptidase” dates from 1918 (Petersen & Short, 1918), and “proteinase” from 1928 (Grassmann & Dycherhoff, 1928). The terms “protease” and “proteinase” are derived from “protein” and the suffix “-ase”, which might imply an ability to cleave proteins, and not all proteolytic enzymes are capable of this, hence the term “peptidase” is preferred (Barrett & McDonald, 1986).

Jo u

rn

One of the difficulties in the classification of peptidases by activity, besides the fact that it fails to reflect evolutionary relationships, is that there are peptidases with multiple activities. For example, cathepsin B can act either as an endopeptidase or as an exopeptidase. As an endopeptidase, its specificity is extremely difficult to define (Biniossek et al., 2011). As an exopeptidase, it releases a dipeptide from the C terminus and is classified as a peptidyl -dipeptidase. A structure known as an “occluding” loop blocks access to some substrate-binding sites restricting the specificity (Musil et al., 1991). Cathepsin H also has dual specificity, but is often thought to be only an aminopeptidase, releasing a single amino acid for the N terminus of a protein or peptide. However, its ability to be inhibited by alpha2-macroglobulin, which requires cleavage of an internal peptidase bond within what is known as the “bait region”, shows that it is an endopeptidase (Mason, 1989). There is also controversy whether a peptidase is just an exopeptidase or whether it is capable of acting as an endopeptidase. The exopeptidase known as dipeptidyl-peptidase I or cathepsin C is such an example, being able to cleave synthetic substrates in which the N terminus is modified, an activity thought to be exclusive to endopeptidases (Kuribayashi et al., 1993). Some endopeptidases cleave bonds involving modified amino acids, but are not classified as omegapeptidases. For example, signal peptidase II, which releases the N terminal signal peptide from prolipoprotein, a component of the bacterial cell wall, cleaves a bond N-terminal to a Cys residue that has been previously modified to diacylglycerol-Cys by diacylglyceryl transferase (Hussain et al., 1982).

Journal Pre-proof Some omega-peptidases can act as endopeptidases. Ubiquitin is synthesized as a precursor that contains multiple copies of ubiquitin, and a DUB may release the individual ubiquitin molecules in a reaction in which it acts as an endopeptidase by cleaving the Gly-Met bond between ubiquitin repeats in the precursor or in synthetically synthesized fusion proteins (Baker et al., 1999).

e-

pr

oo

f

The criteria for distinguishing different peptidases as outlined in Table 1 require some discussion. There are circumstances which could lead to misidentification of a new peptidases. For example, the sequence difference apparent would be greater if a sequencing error resulted in a frameshift or incorporated an intron as coding sequence. Such errors might also give the appearance of different protein architectures, as would misidentification of the initiating methionine or stop codon, or comparison of different splice variants from orthologous genes. Sequencing errors might increase the percentage difference between the sequences, which would also affect their positions on a phylogenetic tree. If the sequence is derived from an incomplete genome sequencing project, then it may not be known that apparent paralogues are actually splice variants from the same gene, and these may again be misidentified as novel peptidases. Because there is some kudos to be gained from studying a novel peptidase, as opposed to a peptidase that is merely an orthologue from a different organism, there may be an unconscious bias amongst researchers to emphasize minor differences in substrate specificity and inhibitor interactions. Very often, non-physiological substrates and inhibitors are used to distinguish peptidases, and it may not be known if these differences have any physiological significance. Conclusions

Jo u

rn

al

Pr

The classification of proteolytic enzymes as originally developed by Rawlings & Barrett (1993) still holds, but some tweaks have been necessary to incorporate the discovery of new catalytic types , to include asparagine peptide lyases, and to cope with clans and families with mixed catalytic types. There has been a substantial increase in the number of peptidase sequences, mainly because of the developments in genome and metagenome sequencing, but also because more families of peptidases have been identified. The numbers of clans and families have also dramatically increased since 1993. In 1998, Barrett et al. introduced the concept of the protein-species, in which a peptidase with the same function is identified between species, and each uniqu e peptidase was assigned an identifier. Once again, the number of identifiers has increased substantially. Partly, this has been because of the introduction of identifiers for uncharacterized proteins from model organisms, although many of these have now been subsequently characterized. In 1993, Rawlings and Barrett noted that classifying peptidases by catalytic type or activity did not reflect evolutionary relationships, and not only does this still hold true, but the evidence for this has increased. Some assumptions made in 1993 have been shown not be correct. The idea that the order of catalytic residues in the sequence reflected structural relationships has been shown to be false because elucidation of some structures has revealed that peptidases with different folds can have the same order of catalytic residues. Although it is still holds in most cases that the active site residues are conserved amongst the peptidases in a family, there are notable exceptions, and there are families in which the nucleophile can be either serine, threonine or cysteine. Acknowledgements I would like to thank Dr Alan Barrett for searching the literature and maintaining the reference collection in MEROPS. I would also like to thank Dr Alex Bateman for his continued support for the MEROPS database and staff and former colleagues at the EBI for helping maintain the website and for help with technical issues.

Journal Pre-proof Funding This project did not receive funding.

Declaration of interest: None

References Rawlings ND, Barrett AJ. (1993) Evolutionary families of peptidases. Biochem J. 290:205-18. PMID: 8439290.

oo

f

Rawlings ND, Barrett AJ, Thomas PD, Huang X, Bateman A, Finn RD. (2018) The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database. Nucleic Acids Res. 46(D1):D624-D632. PMID: 29145643.

Pr

e-

pr

Burley SK, Berman HM, Bhikadiya C, Bi C, Chen L, Di Costanzo L, Christie C, Dalenberg K, Duarte JM, Dutta S, Feng Z, Ghosh S, Goodsell DS, Green RK, Guranovic V, Guzenko D, Hudson BP, Kalro T, Liang Y, Lowe R, Namkoong H, Peisach E, Periskova I, Prlic A, Randle C, Rose A, Rose P, Sala R, Sekharan M, Shao C, Tan L, Tao YP, Valasatava Y, Voigt M, Westbrook J, Woo J, Yang H, Young J, Zhuravleva M, Zardecki C. (2019) RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucleic Acids Res. 47(D1):D464-D474. PMID: 30357411. Barrett AJ, Rawlings ND, Woessner, JF (eds) (1998) The Handbook of Proteolytic Enzymes. Academic Press, San Diego.

al

Barrett AJ, Rawlings ND. (2007) 'Species' of peptidases. Biol Chem. 388:1151-7. PMID: 17976007

rn

Schechter I, Berger A. (1967) On the size of the active site in proteases. I. Papain. Biochem Biophys Res Commun. 27:157-62.

Jo u

Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (1992) Enzyme Nomenclature. Academic Press, San Diego, California. Ishidoh K, Muno D, Sato N, Kominami E. (1991) Molecular cloning of cDNA for rat cathepsin C. Cathepsin C, a cysteine proteinase with an extremely long propeptide. J Biol Chem. 266:16312-7. PMID: 1885565. Tomkinson B, Wernstedt C, Hellman U, Zetterqvist O. (1987) Active site of tripeptidyl peptidase II from human erythrocytes is of the subtilisin type. Proc Natl Acad Sci U S A. 84:7508-12. PMID: 3313395. Allaire M, Chernaia MM, Malcolm BA, James MN. (1994) Picornaviral 3C cysteine proteinases have a fold similar to chymotrypsin-like serine proteinases. Nature 369:72-6. PMID: 8164744. Oinonen C, Rouvinen J. (2000) Structural comparison of Ntn-hydrolases. Protein Sci. 9:2329-37. PMID: 11206054. Smith EL, Markland FS, Kasper CB, DeLange RJ, Landon M, Evans WH. (1966) The complete amino acid sequence of two types of subtilisin, BPN' and Carlsberg. J Biol Chem. 241:5974-6. PMID: 4959323.

Journal Pre-proof Alden RA, Wright CS, Kraut J. (1970) A hydrogen-bond network at the active site of subtilisin BPN'. Philos Trans R Soc Lond B Biol Sci. 257:119-24. PMID: 4399039. Simmonds S, Szeto KS, Fletterick CG. (1976) Soluble tri- and dipeptidases in Escherichia coli K-12+. Biochemistry 15:261-71. PMID: 764862 Mevissen TET, Komander D. (2017) Mechanisms of Deubiquitinase Specificity and Regulation. Annu Rev Biochem. 86:159-192. PMID: 28498721. Amerik AY, Hochstrasser M (2004). Mechanism and function of deubiquitinating enzymes. Biochim. Biophys. Acta 1695: 189–207. PMID 15571815.

f

De Gandarias JM, Irazusta J, Fernandez D, Varona A, Casis L. (1994) Developmental changes of pyroglutamate-peptidase I activity in several regions of the female and the male rat brain. Int J Neurosci. 77:53-60. PMID: 7989161.

pr

oo

Kiss AL, Hornung B, Rádi K, Gengeliczki Z, Sztáray B, Juhász T, Szeltner Z, Harmat V, Polgár L. (2007) The acylaminoacyl peptidase from Aeropyrum pernix K1 thought to be an exopeptidase displays endopeptidase activity. J Mol Biol. 368:509-20. PMID: 17350041.

e-

McDonald JK, Barrett AJ (1986) Mammalian Proteases. A Glossary and Bibliography (Academic Press).

Pr

Fujinaga M, Cherney MM, Oyama H, Oda K, James MN. (2004) The molecular structure and catalytic mechanism of a novel carboxyl peptidase from Scytalidium lignicolum. Proc Natl Acad Sci U S A. 101:3364-9. PMID: 14993599.

al

Tate SS, Meister A. (1985) gamma-Glutamyl transpeptidase from kidney. Methods Enzymol. 113:400-19. PMID: 2868390.

rn

Perkins HR, Nieto M, Frére JM, Leyh-Bouille M, Ghuysen JM. (1973) Streptomyces DD carboxypeptidases as transpeptidases. The specificity for amino compounds acting as carboxyl acceptors. Biochem J. 131:707-18.

Jo u

Rauwerdink A & Kazlauskas RJ. (2015) How the same core catalytic machinery catalyzes 17 different reactions: the serine-histidine-aspartate catalytic triad of α/β-hydrolase fold enzymes. ACS Catal. 5:6153-6176. PMID: 28580193 Robertus JD, Kraut J, Alden RA, Birktoft JJ. (1972) Subtilisin; a stereochemical mechanism involving transition-state stabilization. Biochemistry. 11:4293-303. PMID: 5079900. Brannigan JA, Dodson G, Duggleby HJ, Moody PC, Smith JL, Tomchick DR, Murzin AG. (1995) A protein catalytic framework with an N-terminal nucleophile is capable of self-activation. Nature 378:416-9. PMID: 7477383. Jongeneel CV, Bouvier J, Bairoch A. (1989) A unique signature identifies a family of zinc-dependent metallopeptidases. FEBS Lett. 242:211-4. PMID: 2914602. Hooper NM. (1994) Families of zinc metalloproteases. FEBS Lett. 354:1-6. PMID: 7957888 Bode W, Gomis-Rüth FX, Stöckler W. (1993) Astacins, serralysins, snake venom and matrix metalloproteinases exhibit identical zinc-binding environments (HEXXHXXGXXH and Met-turn) and topologies and should be grouped into a common family, the 'metzincins'. FEBS Lett. 331:134-40. PMID: 8405391.

Journal Pre-proof Fushimi N, Ee CE, Nakajima T, Ichishima E. (1999) Aspzincin, a family of metalloendopeptidases with a new zinc-binding motif. Identification of new zinc-binding sites (His(128), His(132), and Asp(164)) and three catalytically crucial residues (Glu(129), Asp(143), and Tyr(106)) of deuterolysin from Aspergillus oryzae by site-directed mutagenesis. J Biol Chem. 274:24195-201. PMID: 10446194. Heitzer M, Hallmann A. (2002) An extracellular matrix-localized metalloproteinase with an exceptional QEXXH metal binding site prefers copper for catalytic activity. J Biol Chem. 277:28280-6. PMID: 12034745. Rawlings ND, Barrett AJ, Bateman A. (2011) Asparagine peptide lyases: a seventh catalytic type of proteolytic enzymes. J Biol Chem. 286:38321-8. doi: 10.1074/jbc.M111.260026. PMID: 21832066.

f

Mizutani R, Nogami S, Kawasaki M, Ohya Y, Anraku Y, Satow Y. (2002) Protein-splicing reaction via a thiazolidine intermediate: crystal structure of the VMA1-derived endonuclease bearing the N and Cterminal propeptides. J Mol Biol. 316:919-29.

oo

Noren CJ, Wang J, Perler FB (2000). Dissecting the chemistry of protein splicing and its applications. Angew Chem Int Ed Engl. 39: 450–66. PMID 10671234

pr

Studholme DJ, Rawlings ND, Barrett AJ, Bateman A. (2003) A comparison of Pfam and MEROPS: two databases, one comprehensive, and one specialised. BMC Bioinformatics 4:17. PMID: 12740029.

e-

UniProt Consortium (2019) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47(D1):D506-D515. PMID: 30395287.

al

Pr

Sayers EW, Beck J, Brister JR, Bolton EE, Canese K, Comeau DC, Funk K, Ketter A, Kim S, Kimchi A, Kitts PA, Kuznetsov A, Lathrop S, Lu Z, McGarvey K, Madden TL, Murphy TD, O'Leary N, Phan L, Schneider VA, Thibaud-Nissen F, Trawick BW, Pruitt KD, Ostell J. (2019) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res.(in press) PMID: 31602479.

rn

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. J Mol Biol. 215:403-10. PMID: 2231712.

Jo u

Pearson WR. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183:63-98. PMID: 2156132. Potter SC, Luciani A, Eddy SR, Park Y, Lopez R, Finn RD. (2018) HMMER web server: 2018 update. Nucleic Acids Res. 46(W1):W200-W204. PMID: 29905871. Holm L, Laakso LM. (2016) Dali server update. Nucleic Acids Res. 44(W1):W351-5. PMID: 27131377. Edgar RC. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32:1792-7. PMID: 15034147. Howe K, Bateman A, Durbin R. (2002) QuickTree: building huge Neighbour-Joining trees of protein sequences. Bioinformatics 18:1546-7. PMID: 12424131. Rawlings ND. (2009) A large and accurate collection of peptidase cleavages in the MEROPS database. Database (Oxford) 2009:bap015. PMID: 20157488. Rawlings ND, Morton FR, Kok CY, Kong J, Barrett AJ. (2008) MEROPS: the peptidase database. Nucleic Acids Res. 36(Database issue):D320-5. PMID: 17991683. Rawlings ND, Bateman A. (2019) Origins of peptidases. Biochimie 166:4-18. PMID: 31377195.

Journal Pre-proof Ding J, McGrath WJ, Sweet RM, Mangel WF. (1996) Crystal structure of the human adenovirus proteinase with its 11 amino acid cofactor. EMBO J. 15:1778-83. PMID: 8617222. Feng L, Yan H, Wu Z, Yan N, Wang Z, Jeffrey PD, Shi Y. (2007) Structure of a site -2 protease family intramembrane metalloprotease. Science 318:1608-12. PMID: 18063795. Taylor AB, Smith BS, Kitada S, Kojima K, Miyaura H, Otwinowski Z, Ito A, Deisenhofer J. (2001) Crystal structures of mitochondrial processing peptidase reveal the mode for spe cific cleavage of import signal sequences. Structure 9:615-25. PMID: 11470436. Hu Y, Peng N, Han W, Mei Y, Chen Z, Feng X, Liang YX, She Q. (2012) An archaeal protein evolutionarily conserved in prokaryotes is a zinc-dependent metalloprotease. Biosci Rep. 32:609-18. PMID: 22950735.

oo

f

Wlodawer A, Li M, Gustchina A, Oyama H, Dunn BM, Oda K. (2003) Structural and enzymatic properties of the sedolisin family of serine-carboxyl peptidases. Acta Biochim Pol. 50:81-102. PMID: 12673349.

pr

Malfitano (1900) Sur la protease de l'Aspergillus niger. Ann. Inst Pasteur. 14: 430.

e-

Weis (1903) Études sur les enzymes protéolytiques de l'orge en germination; Compte-rendu des travaux du Laboratoire de Carlsberg. 5: 133. Vines (1903) Proteolytic enzymes in plants (I). Ann. Bot. 17: 354.

Pr

Petersen & Short (1918) On the relation of the serum ereptase (peptidase) titer to the clinical course in pneumonia. J Infect Dis. 22: 14-153.

al

Grassmann,W. & Dyckerhoff,H. (1928) Uber die Proteinase und die Polypeptidase der Hefe. 13. Abhandlung über Pflanzenproteasen in der von R. Willstätter und Mitarbeitern begonnen Untersuchungsreihe. Hoppe Seylers Z Physiol Chem. 179: 41-78.

Jo u

rn

Barrett AJ & McDonald JK. (1986) Nomenclature: protease, proteinase and peptidase. Biochem J. 237:935. PMID: 3541905 Biniossek ML, Nägler DK, Becker-Pauly C, Schilling O. (2011) Proteomic identification of protease cleavage sites characterizes prime and non-prime specificity of cysteine cathepsins B, L, and S. J Proteome Res. 10:5363-73. doi: 10.1021/pr200621z. PMID: 21967108. Musil D, Zucic D, Turk D, Engh RA, Mayr I, Huber R, Popovic T, Turk V, Towatari T, Katunuma N, et al. (1991) The refined 2.15 A X-ray crystal structure of human liver cathepsin B: the structural basis for its specificity. EMBO J. 10:2321-30. PMID: 1868826. Mason RW. (1989) Interaction of lysosomal cysteine proteinases with alpha 2-macroglobulin: conclusive evidence for the endopeptidase activities of cathepsins B and H. Arch Biochem Biophys. 273:367-74. PMID: 2476070. Kuribayashi M, Yamada H, Ohmori T, Yanai M, Imoto T. (1993) Endopeptidase activity of cathepsin C, dipeptidyl aminopeptidase I, from bovine spleen. J Biochem. 113:441-9. PMID: 8514733. Baker RT, Wang XW, Woollatt E, White JA, Sutherland GR. (1999) Identification, functional characterization, and chromosomal localization of USP15, a novel human ubiquitin-specific protease related to the UNP oncoprotein, and a systematic nomenclature for human ubiquitin-specific proteases. Genomics. 59:264-74. PMID: 10444327.

Journal Pre-proof

Jo u

rn

al

Pr

e-

pr

oo

f

Hussain M, Ichihara S, Mizushima S.(1982) Mechanism of signal peptide cleavage in the biosynthesis of the major lipoprotein of the Escherichia coli outer membrane. J Biol Chem. 257:5177-82. PMID: 7040395.

Journal Pre-proof Tables Table 1. Criteria for distinguishing a peptidase homologue from a peptidase hol otype. The criteria are derived from Barrett & Rawlings (2007). Table 2. Comparison of the classifications from 1993 and 2019. Table 3. Distribution of peptidase homologues amongst catalytic types. For each catalytic type, the number and percentage of sequences, protein-species, families and clans is shown. Table 4. Distribution of exopeptidases amongst peptidase families.

oo

f

“Activity” indicates the kind of exopeptidase activity. For each different activity, the peptidase can and family is shown and examples from each family. Table 5. Known active site residues and metal ligands for clans and subclans of peptidases.

Pr

Table 6. Peptidases from model organisms.

e-

pr

For each clan or subclan the active site residues (and metal ligands for metallopeptidases) are shown in single letter amino acid code. A comma separates the residues in each catalytic dyad, triad or tetrad or metal ligand. A forward slash separates alternative amino acids that are known to occur in active peptidases for any catalytic residue or metal ligand.

rn

Figure legends

al

For each model organism, the total number of peptidases (MEROPS identifiers); the number of peptidases of each catalytic type (“Asp”, aspartic; “Glu”, glutamic; “Cys”, cysteine; “Met”, metallo; “Ser”, serine; “Thr”, threonine; “Mix”, mixed; “ALP”, asparagine peptide lyase; “Unk”, unknown); the number of peptidase holotypes (“Hol”); and the number of uncharacterized peptidases (“Unc”) is shown.

Jo u

Fig. 1. Schechter-Berger nomenclature and endo- and exopeptidases. In the following, the substrate is shown linearly as a “string of pearls” with the N terminus to the left and the C terminus to the right, with each “pearl” representing an amino acid and the peptide bonds indicated by the horizontal, black lines. The residues in P4 to P4’ are labelled. The scissile bond is indicated by the vertical, red arrow. A) The Schecter-Berger nomenclature for substrate binding sites is shown for the activity of an endopeptidase. The endopeptidase is shown in blue. Substrate binding pockets S4 to S4’ are labelled and indicated by vertical, dotted lines. B) Cleavage of a dipeptide by a dipeptidase is shown. C) Cleavage of a peptide by an aminopeptidase is shown. D) Cleavage of a peptide by a dipeptidyl-peptidase is shown. E) Cleavage of a peptide by a tripeptidylpeptidase is shown. F) Cleavage of a peptide by a carboxypeptidase is shown. G) Cleavage of a peptide by a peptidyl-dipeptidase is shown. Fig. 2. Flow diagram to show how an apparently novel peptidase activity is classified to a MEROPS identifier.

Journal Pre-proof Supplementary table 1

Jo u

rn

al

Pr

e-

pr

oo

f

The full MEROPS classification for all holotypes is shown. Names of peptidases are taken from the literature or from the gene or gene locus name, except when none could be found and a name is invented (starting with Mername). Where the name is derived from a gene name or gene locus name, and that name is not unique (i.e. other organisms have the same gene name but for a product that is not an orthologue) then the scientific name for the species that is the source of the holotype is given. The suffix “-type” is added when orthologues from other species have been identified. Abbreviations: g.p., gene product; obs., obsolete.

Journal Pre-proof Table 1. Criteria for distinguishing a peptidase homologue from a peptidase holotype.

Jo u

rn

al

Pr

e-

pr

oo

f

1. The sequences are less than 50% identical. 2. The sequences are derived from paralogue genes in the genome of a model organism. 3. The domain architectures are different or the domains are in a different order. 4. Targeting peptides differ or are absent in one. 5. Sequences are not predicted to be orthologues from their positions on a phylogenetic tree. 6. Actions on substrates differ significantly. 7. Interactions with peptidase inhibitors differ significantly. 8. The peptidases are active at different environmental conditions ( e.g. pH, temperature). 9. The peptidases are active in different cellular compartments.

Journal Pre-proof Table 2. Comparison of the classifications from 1993 and 2019. 1993 676 749 84 7 -

2019 Increase (fold) 1,137,158 1682 4341 5.8 272 3.2 56 8 69,126 95,653 1830 -

Jo u

rn

al

Pr

e-

pr

oo

f

Item Sequences Protein-species Families Clans References Substrate cleavages Inhibitor complexes

Journal Pre-proof

Table 3. Distribution of peptidase homologues amongst catalytic types.

f

% Families % Clans 39 54 20 12 23 96 35 11 2 6 2 25 76 28 16 7 16 6 5 0 3 1 2 1 2 1 5 0 9 3 0 10 4 5 100 272 100 56

Jo u

rn

al

Pr

e-

pr

oo

Catalytic type Sequences % Protein-species Serine 421996 37 1710 Cysteine 195537 17 1016 Threonine 39407 3 105 Metallo 406444 38 1089 Aspartic 55726 5 319 Glutamic 1067 0 8 Mixed 6588 1 53 Unknown 6577 1 17 Asparagine peptide lyases 3816 0 24 Total 1137158 100 4341

% 21 20 29 9 4 9 9 100

Journal Pre-proof Table 4. Distribution of exopeptidases amongst peptidase families.

C69 T2 S51 S33 C1 M1

MA MA

M54 M61

MF MG

M17 M24

MH MH MH MN MQ unassigned PB PE SC SC SE CA CO MA PA SC SC SC SB SB SC SC MA MC

rn Jo u Dipeptidyl-peptidase

Tripeptidyl-peptidase

Carboxypeptidase

f

PB PB PC SC CA MA

oo

M24 M19 M38

pr

MG MJ MJ

Examples vanX D -Ala-D -Ala dipeptidase carnosine dipeptidases I and II; Xaamethyl-His dipeptidase; Xaa-His dipeptidase; cytosolic beta-alanyl-lysine dipeptidase Xaa-Pro dipeptidase; PH0974 dipeptidase membrane dipeptidases 1, 2 and 3 isoaspartyl dipeptidase; Pro-Hyp dipeptidase dipeptidase A isoaspartyl dipeptidase dipeptidase E; alpha-aspartyl dipeptidase prolyl dipeptidase aminopeptidase C aminopeptidases A, B, G, N, O and Q; alanyl aminopeptidase; cystinyl aminopeptidase; ERAP2 aminopeptidase aminopeptidases AMZ1 and AMZ2 glycyl aminopeptidase; TET aminopeptidases leucyl aminopeptidase methionyl aminopeptidases 1 and 2; aminopeptidase P aminopeptidase I aminopeptidases S and Y glutamyl aminopeptidase D -aminopeptidase DppA aminopeptidase T; PepS aminopeptidase tryptophanyl aminopeptidases arginine aminopeptidase DmpA aminopeptidase tyrosyl aminopeptidase prolyl aminopeptidase aminopeptidase DmpB dipeptidyl-peptidase I dipeptidyl-peptidase VI dipeptidyl-peptidase III dipeptidyl-peptidases 7 and 11 dipeptidyl-peptidases IV, 5, 8 and 9 Xaa-Pro dipeptidyl-peptidase dipeptidyl-peptidase II tripeptidyl-peptidase II tripeptidyl-peptidase I prolyl tripeptidyl peptidase tripeptidyl-peptidases A, B and C carboxypeptidase Taq carboxypeptidases A1, A2, A3, A4, A5, A6,

e-

Family M15 M20

M18 M28 M42 M55 M29 M77 C69 P1 S9 S33 S12 C1 C40 M49 S46 S9 S15 S28 S8 S53 S9 S33 M32 M14

al

Aminopeptidase

Clan MD MH

Pr

Activity Dipeptidase

Journal Pre-proof

M28

SC SC

S9 S10

SC SE SE SE SS

S28 S11 S12 S13 S66

MA

M2

MA

M3

f

MH

pr

oo

M15 M20

Jo u

rn

al

Pr

e-

Peptidyl-dipeptidase

MD MH

B, E, M, N,O, T and Z; cytosolic carboxypeptidase 1, 2, 3, 4, 5 and 6 zinc D -Ala-D -Ala carboxypeptidase glutamate carboxypeptidase I; Gly-Xaa carboxypeptidase glutamate carboxypeptidases II and III; carboxypeptidase Q S9dr carboxypeptidase carboxypeptidases A, C, D, O, P and Y; kex carboxypeptidase lysosomal Pro-Xaa carboxypeptidase D -Ala- D -Ala carboxypeptidases A and DacF D -Ala- D -Ala carboxypeptidase B D -Ala- D -Ala carboxypeptidase PBP3 murein tetrapeptidase LD carboxypeptidase angiotensin-converting enzyme; peptidyldipeptidases Acer and Ance peptidyl-dipeptidase Dcp

Journal Pre-proof Table 5. Known active site residues and metal ligands for clans and subclans of peptidases.

Clan or subclan

Families or subfamilies

Active site residues (metal Exceptions ligands)

AA

A1

Asp, Tyr, Asp

A01.043: His, Ser, Asp A01.097: Asp, Phe, Asp;

Asp (dimer)

AC

A8

Asp, Asp

AD

A22, A24

Asp, Asp

AE

A25

Asp, Asp, Lys

A31

Asp, Asp, His

AF

A26

Asp, Asp, Asp, His

CA

C1, C2, C12, C47

Gln, Cys, His, Asn

C6, C10, C21, C31, C32, C51, C64, C76, C86, C104, C105

Cys, His

C19

Asn, Cys, His, Asp/Asn

A31.001: Glu, Asp, His

C01.005: Gln, Cys, His, Lys;

rn

al

Pr

e-

pr

oo

f

A2, A3, A9, A11, A28, A32

Cys, His, Asp

C28

Asn, Cys, His, Asp

Jo u

C16, C58, C70, C71, C78B, C87, C93, C96, C98, C110, C111, C113, C117

C39, C67, C83

Gln, Cys, His, Asp

C54, C78A

Tyr, Cys, Asp, His

C65, C85

Asp, Cys, His

C66

Cys, His, Asp, Asp

C100

Cys, His, Glu

C101

Asp, Cys, His, Asn

C115, C121

Gln, Cys, His

C39.008: Gln, Cys, His, Glu

C66.002: Cys, His, Glu, Asn

Journal Pre-proof Families or subfamilies

Active site residues (metal Exceptions ligands)

CD

C11, C13, C14, C25, C50, C80

His, Cys

CE

C5, C55

His, Glu, Gln, Cys

C48, C57

His, Asp, Gln, Cys

C63

His, Asn, Gln, Cys

C79

His, Cys

CF

C15

Glu, Cys, His

CL

C60, C82A

His, Cys

C82B

His, Gly, Cys, Arg

CM

C18

His, Glu, Cys

CN

C9

Cys, His

CO

C40

Cys, His, His

al

Pr

e-

pr

oo

f

Clan or subclan

C97

His, Cys

CQ

C53

CR

C108

GA

G1

GB

G2

Asp, Glu

MA(D)

M6, M7, M35, M41, M64

(His), Glu, (His), (Asp)

MA(E)

M1, M34

(His), Glu, (His), (Glu), Tyr

M2, M3, M5, M9, M30, M32, M41, M48, M49, M56, M61, M78, M90, M93, M98, M100, M101

(His), Glu, (His), (Glu)

M4

(His), Glu, (His), (Glu), His

M13

(His), Glu, (His), (Glu), Asp

rn

CP

Jo u

His, Cys His, Cys Gln, Glu

C40.002, C40.007, C40.010: Cys, His, Asn C40.005, C40.013: Cys, His, Gln C40.011, C40.012: Cys, His, Glu

Journal Pre-proof

MA(M)

Families or subfamilies

Active site residues (metal Exceptions ligands)

M26, M36, M60, M76, M91, M102

(His), Glu, (His)

M27

(His), Glu, (His), (Glu), Arg, Tyr

M85

(His), Glu, (His), (Asp), (Tyr)

M95

(His), Glu, (His), (His)

M8, M43B

(His), Glu, (His), (His), Met M54.003: (His), Glu, (His), (Asn)

M11

(His), Glu, (His), (His)

M14A, M14B, M14D, M99

(His), (Glu), Arg, (His), Glu M14.027, M14.036: (His), (Glu), Arg, (His)

M14C, M86

(His), (Glu), (His), Glu

M15A, M74

(His), (Asp), His, (His)

M15C ME

M16 M44

Pr

(His), (Asp), Glu, (His) (His), (Asp), Asp, (His) (His), Glu, (His), Glu, (Glu) (His), Glu, (His), (Glu)

MF

M17

MG

M24A

His, (Asp), (Asp), (His), (Glu), (Glu)

M24B

His, (Asp), (Asp), His, (His), His, (Glu), (Glu)

MH

M11.002: (Gln), Glu, (His), (His)

al

M15B

e-

(His), Glu, (His), (His), Tyr

rn

MD

M10B

Jo u

MC

pr

oo

M10A, M10C, M12, M43A, (His), Glu, (His), (His) M54, M57, M66, M72, M80, M84, M97

f

Clan or subclan

(Lys), (Asp), Lys, (Asp), (Asp), (Glu), Arg

M18, M20A, M20B, M20F, (His), Asp, (Asp), Glu, (Glu), M28, M42 (Asp), (His) M20D

(Asp), Asp, (Asp), Glu, (Glu), (His), (His)

M20.019: (Glu), Asp, (His), Glu, (Glu), (His), (His)

Journal Pre-proof Families or subfamilies

Active site residues (metal Exceptions ligands)

MJ

M19

(His), (Asp), (Glu), (His), (His)

M19.004: (His), (Asp), (Glu), (Tyr), (His)

M38

(His), (His), (Lys), (His), (His), Asp

M38.002: (His), (His), (Lys), (Thr), (His), Asp

MM

M50

(His), Glu, (His), (Asp)

MN

M55

(Asp), (Glu), (His), (His), His, (Glu)

MO

M23

(His), (Asp), His, (His)

MP

M67

Glu, (His), (His), (Asp)

MQ

M29

(Glu), (Glu), (His), Tyr, (His), (Asp)

MS

M75

MT

M81

(Asp), (His), (His)

MU

M103

(His), Glu, (His), (Cys)

NA

N1

Asp, Asn

oo

pr

e-

Pr

al

Glu, Asn

rn

N2 N8

Asn

N6

NC

N7

ND

N4

NE

N5

Asn

PA(C)

C3A, C3D, C3H, C74

His, Glu, Cys

C3B, C3C, C3E, C3F, C3G, C4, C24, C99

His, Asp, Cys

C30, C37, C62, C107

His, Cys

S1, S3, S6, S7, S29, S30, S31, S32, S39, S46, S55, S64, S75

His, Asp, Ser

Jo u

NB

PA(S)

f

Clan or subclan

Asn Asn Asn, Tyr, Glu, Arg

C24.002: His, Glu, Cys;

Journal Pre-proof Active site residues (metal Exceptions ligands)

S65

His, Glu, Ser

C44, C45, C59, C69

Cys

C89

C, R, D

PB(S)

P2A, S45

Ser

P2B: Thr

PB(T)

T1, T2, T3, T7

Thr

T03.022: Ser

PC(C)

C26

Cys, His, Glu

C56

Glu, Cys, His

PC(S)

S51

Ser, His, Glu

PD(C)

C46

Cys, Thr, His

PD(N)

N9, N10

Cys, Asn, Cys

oo pr

Pr

e-

PB(C)

f

Families or subfamilies

Clan or subclan

T5 SB

S8

Ser

rn

P1

Jo u

PE

N10.002: Cys, Asn, Ser N10.005: Cys, Gln, Cys N10.007: Ser, Asn, Ser

Cys, Asn

al

N11

N10.001: Cys, Asn, Thr

P01.101: Thr P01.102: Thr

Thr Asp, His, Asn, Ser

S08.073: Asp, His, Asp, Ser S08.073: Asp, His, Asp, Ser S08.109: Asp, His, Asp, Ser

S53

Glu, Asp, Asp, Ser

SC

S9, S10, S15, S28, S33, S37, Ser, Asp, His S82

SE

S11, S13

Ser, Lys, Ser

S12

Ser, Lys, Tyr

S24, S26A, S26C

Ser, Lys

SF

Active site residues (metal Exceptions ligands)

S26B

Ser, His

S21

His, Ser, His

S73

Asp, His, Ser

S78, S80

His, Ser

SJ

S16, S50, S69

Ser, Lys

SK

S14

Ser, His, Asp

S41A, S49C

Ser, Lys

S41B

Ser, His, Ser, Glu

S49A

Lys, Ser, Ser

S49B

Ser, Ser, Lys

SO

S74

Ser, Lys

SP

S59

His, Ser

SR

S60

Lys, Ser

SS

S66

ST

S54

oo

pr

e-

Pr

Ser, Glu, His Ser, His

rn

SH

Jo u

Clan or subclan

f

Families or subfamilies

al

Journal Pre-proof

Journal Pre-proof Table 6. Peptidases from model organisms Glu

Cys

Met

Ser

Thr

Mix

APL

Unk

Hol

Unc

31 28 11 21

0 0 0 0

158 164 108 67

166 164 96 119

181 216 99 241

20 12 8 14

31 22 21 1

0 0 0 0

0 0 0 0

470 222 28 408

0 26 26 344

359 564 112

27 74 12

0 0 0

97 126 34

129 82 35

94 271 23

11 11 7

1 0 0

0 0 1

0 0 0

320 517 100

285 446 23

83

5

0

27

24

19

8

0

0

0

57

47

141

10

0

47

37

40

7

0

0

0

98

102

85 148 157 48

11 12 4 2

0 0 1 0

33 19 26 7

18 46 56 22

0 0 0 0

0 5 1 1

0 7 3 0

73 112 91 26

48 35 50 17

oo

f

Asp

Jo u

rn

al

Pr

e-

Human Mouse Danio rerio Drosophila melanogaster Caenorhabditis elegans Arabidopsis thaliana Saccharomyces cerevisiae Schizosaccharomyces pombe Dictyostelium discoideum Plasmodium falciparum Escherichia coli Bacillus subtilis Pyrococcus furiosus

MEROPS identifiers 584 606 343 463

17 56 61 12

pr

Organism

6 3 5 2

Journal Pre-proof Highlights

f oo pr ePr al



rn



Proteolytic enzymes have been classified by tertiary structure similarity into clans and by sequence similarity into families. A number of criteria have been devised to differentiate paralogous enzymes within each family, and each has been assigned to a unique identifier. There has been a significant increase in the number of clans, families, different enzymes and active site mechanisms since the original MEROPS classification in 1993.

Jo u



Figure 1

Figure 2