MemSTATS: A Benchmark Set of Membrane Protein Symmetries and Pseudosymmetries

MemSTATS: A Benchmark Set of Membrane Protein Symmetries and Pseudosymmetries

Journal Pre-proof MemSTATS: A Benchmark Set of Membrane Protein Symmetries and PseudoSymmetries Antoniya A. Aleksandrova, Edoardo Sarti, Lucy R. Forre...

4MB Sizes 0 Downloads 19 Views

Journal Pre-proof MemSTATS: A Benchmark Set of Membrane Protein Symmetries and PseudoSymmetries Antoniya A. Aleksandrova, Edoardo Sarti, Lucy R. Forrest PII:

S0022-2836(19)30575-3

DOI:

https://doi.org/10.1016/j.jmb.2019.09.020

Reference:

YJMBI 66279

To appear in:

Journal of Molecular Biology

Received Date: 29 May 2019 Revised Date:

30 August 2019

Accepted Date: 23 September 2019

Please cite this article as: A.A. Aleksandrova,, E. Sarti, L.R. Forrest, MemSTATS: A Benchmark Set of Membrane Protein Symmetries and Pseudo-Symmetries, Journal of Molecular Biology, https:// doi.org/10.1016/j.jmb.2019.09.020. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.

Credit Author Statement Antoniya Aleksandrova: Conceptualization, Methodology, Data curation, Writing- Original Draft, Software, Visualization, Formal Analysis, Investigation. Edoardo Sarti: Conceptualization, Writing- Reviewing and Editing. Lucy Forrest: Conceptualization, Supervision, WritingReviewing and Editing, Resources, Project Administration, Funding Acquisition

AnAnaS MEMSTATS Membrane Protein Symmetry Data Set

QuatSymm CE-Symm SymD

MemSTATS: A Benchmark Set of Membrane Protein Symmetries and Pseudo-Symmetries

Antoniya A. Aleksandrova, Edoardo Sarti*, Lucy R. Forrest

Computational Structural Biology Section, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892, USA. [email protected], [email protected], [email protected].

*Present address: Laboratoire de Biologie Computationnelle et Quantitative, CNRS - Sorbonne Université, Institut de Biologie Paris-Seine, Case courrier 1540, 4 Place Jussieu, Paris 75005, France

Correspondence to Lucy R. Forrest: Computational Structural Biology Section, National Institute of Neurological Disorders and Stroke, National Institutes of Health, 35 Convent Drive, Rm 3D991 MSC 3761, Bethesda, MD 20892-3761, USA. Tel. +1-301-402-2012. Fax +1-301-480-1720. [email protected]

1

Abstract

In membrane proteins, symmetry and pseudo-symmetry often have functional or evolutionary implications. However, available symmetry detection methods have not been tested systematically on this class of proteins due to the lack of an appropriate benchmark set. Here we present MemSTATS, a publicly-available benchmark set of both quaternary and internal symmetries in membrane protein structures. The symmetries are described in terms of order, repeated elements, and orientation of the axis with respect to the membrane plane. Moreover, using MemSTATS, we compare the performance of four widely-used symmetry detection algorithms and highlight specific challenges and areas for improvement in the future.

Keywords

Evolution; Duplication; Asymmetry; Quaternary symmetry; Internal symmetry

Abbreviations MemSTATS - Membrane protein Structures And Their Symmetries

2

Membrane proteins are encoded by around one third of a given genome [1–3] and play key roles in transmission of information and chemicals such as neurotransmitters into the cell. Available membrane protein structures have revealed an abundance of symmetry and pseudo-symmetry [4–7] Even though the constraints of the two-dimensional lipid bilayer limit the range of folds available to membrane proteins [8,9], the increased likelihood of complexes encountering each other in the concentrated environment of the membrane favors formation of oligomers [9,10], which, from an energetic standpoint, are more likely to be symmetric than not [11]. In addition to multi-subunit assemblies, membrane proteins also commonly contain repetitions of internal structural elements [4,5]. As for water-soluble proteins, internal symmetries are thought to have arisen primarily from intragenic duplication and recombination events and to confer specific advantages such as structural stability, foldability, enhanced complexity, and increased surface area for interaction with ligands and other proteins [7,12,13]. Indeed, symmetry is often intimately associated with the function of membrane proteins, including permeation, transport, signaling, and regulation [6,12,14,15]. The consideration of repeated elements or symmetry has been used to aid in structure prediction [16–18], protein design [19], and mechanistic [20,21] and evolutionary studies [22–24] of membrane proteins. Most of these studies have relied on analyses of individual membrane protein structures. However, the current pseudo-exponential growth in the number of structures of membrane proteins [25] calls for more automated and reproducible approaches.

A number of sophisticated algorithms exist that detect internal repeated structural elements in proteins. Many of these methods were designed specifically to identify arrays of <80 residue-long repeats, known as tandem repeats [26], including RAPHAEL [27], TAPO [28], ProSTRIP [29], ConSole [30], ReUPred [31] and RepeatsDB-lite [32]. While some transmembrane β-barrels can be considered “closed” tandemrepeat proteins [26], α-helical membrane proteins, the largest and most diverse subgroup of membrane proteins, do not belong in the class of tandem repeats, due to their compactness. Methods to detect

3

larger and more compact structural repeats, typically using self-alignment approaches including a tiling approach [33], DAVROS [34], OPAAS [35–38], GANGSTA+ [39] and PRIGSA [40]. Notably, only two methods have been designed specifically to detect internal symmetry relationships, not only the presence of structurally-related repeat elements: SymD [41] and CE-Symm [5,42].

There has been limited effort to investigate the applicability of these repeat- or symmetry-detection methods to membrane proteins [5]. The performance of these algorithms for membrane proteins might differ for several reasons. First, the constraints due to the lipid bilayer are expected to limit the types of symmetries available. Second, the sequences of membrane proteins have diverged extensively [15,43], which can hinder the detection of internal symmetry. Third, the structures of some membrane proteins exhibit functional asymmetry [6], which might necessitate detection of multiple small, non-hierarchical symmetries to fully capture the relevant characteristics. Assessing the applicability of existing symmetry detection algorithms to membrane proteins, however, is currently hampered by the lack of a reference benchmark that can be used to evaluate the performance of the detection methods. Although the protein data bank, PDB [44] reports the presence of quaternary symmetry [45], internal symmetries are not reported (see description of QuatSymm below). Previous efforts to curate benchmark sets for structural symmetries have either excluded membrane proteins entirely [41] or have included very few cases in which the symmetry is present in membrane-embedded segments [5,42,46]. In a manuallycurated dataset of 1007 symmetries within protein domains [5,42], for example, only 25 of the symmetry entries are membrane-embedded. The tandem repeats database RepeatsDB contains relevant entries for 32 distinct transmembrane β-barrels, but none for α-helical membrane proteins [46]. Finally, none of the aforementioned datasets specify the orientation of the symmetry axis relative to the membrane plane.

4

To understand how well-suited the available methods are for detecting membrane protein symmetries and to facilitate future investigations of the effect of the lipid bilayer on a protein’s predisposition for symmetry, we compiled a benchmark set of Membrane protein Structures And Their Symmetries (MemSTATS). We chose to describe membrane proteins in terms of complexes and their membranespanning chains rather than domains. While many reputable databases and benchmarks use domains as the fundamental unit [47,48], the definition of domains in membrane proteins is often ambiguous [49,50]. Instead, we organize our data around protein chains, each being a single gene product [51], which allows a natural categorization of both internal and quaternary symmetries in every entry of the dataset.

While compiling the MemSTATS set, we aimed to describe the diversity of membrane protein symmetries, while limiting the number of trivial cases such as a repeat composed of a single transmembrane helix. We started from the set of protein architectures and symmetries described in [6] and refined and expanded this set to create MemSTATS. Symmetries in protein chains were included only if the chain shared no more than 85% sequence identity and no more than 0.65 TM-score from FrTM-align [52] superposition with any of the other chains in the data set. Quaternary symmetries were described either if a chain from the complex was already included in the data set or if its overall architecture was distinct from the other complexes. The resultant set comprises 87 α-helical and 23 βbarrel membrane proteins, each with distinct architecture either at the quaternary or internal level.

For each symmetry, the following features are specified: symmetry order; axis orientation with respect to the membrane; approximate amino-acid range of the repeats; whether the symmetry is cyclic or open (as in helical or linear symmetries); whether the symmetry has been explicitly mentioned in the literature; and whether the repeats are interdigitating (https://doi.org/10.5281/zenodo.1345121). Unlike the available CE-Symm symmetry benchmark set, which included only the symmetry order and

5

the symmetry group [5], the more extensive descriptions in MemSTATS make it possible to assess not only whether an algorithm detects a symmetry, but also how precisely the symmetry is characterized. The composition of the MemSTATS data set is shown in Table S1 and Fig. S1, which highlight the frequency of each type of symmetry and the diversity of folds characterized by the number of transmembrane segments.

Using MemSTATS, we have assessed the performance of two well-established computational algorithms for detecting internal symmetry in protein structures, namely SymD v1.61 [41] and CE-Symm v2.0 RC3 [5,42], as well as two methods designed exclusively for quaternary symmetry detection, namely AnAnaS v0.8 [53,54] and QuatSymm v2.1.0, which is used for symmetry annotations of complexes in the PDB [45] (Fig. 1, 2, 3).

The two quaternary-symmetry detection methods, AnAnaS and QuatSymm, differ significantly in their approach to detecting the point group and symmetry axes for a complex. However, both methods strive to optimize the speed of repeat detection by taking advantage of the fact that in quaternary symmetry there is often a correspondence between repeat and chain boundaries. Therefore, to identify symmetryrelated subunits, the two algorithms rely on sequence alignment and structural superposition of each pair of chains. While this strategy makes them suitable for efficiently scanning a large number of protein complexes, it also means that, by design, AnAnaS and QuatSymm cannot detect symmetries within a protein chain.

On the other hand, SymD and CE-Symm both specifically target internal symmetries. To identify structurally similar regions of the protein chain, these two algorithms align the protein structure to itself. For example, SymD aligns two copies of the same protein, circularly permutes the second copy, one residue at a time, and scores the resulting structural superpositions. This approach allows SymD to deduce the order and axis of symmetry, as well as the number of residues in the protein that are

6

symmetry-related, but not the boundaries of the repeats. In contrast, CE-Symm uses dynamic programming to align small fragments of the protein and then applies a series of thresholds (e.g., of fragment size) to identify a global structural self-alignment from which both symmetry axes and repeat boundaries can be extracted. Designed with internal symmetry in mind, both SymD and CE-Symm implicitly assume that connected, repeated elements tend to be not only sequentially but also spatially proximate. This assumption, however, can make it challenging to use SymD or CE-Symm for detecting the order of quaternary symmetry in complexes with more than three symmetry-related repeats, since all possible arrangements of the repeats need to be taken into account. Therefore, SymD and CE-Symm, which process complexes as a single polypeptide comprised of the chains concatenated in the order of their appearance in the PDB file, struggle to detect symmetry in large complexes in which the names of the chains do not correspond to their spatial proximity (e.g., PDB: 1K4C; Fig. 1E). Note that AnAnaS and QuatSymm can both correctly characterize such cases.

While MemSTATS allows for the four algorithms to be benchmarked against each other, it should be noted that the set is relatively small due to the current availability of structures and the emphasis on avoiding redundancy in the types of challenges presented. Therefore, overall metrics for scoring should be carefully selected to take into account the size and imbalance in the dataset. With this in mind, we summarized the key results for each method using confusion matrices (Fig. 1A, 2A) and also compared the probability that each algorithm produces informed predictions relative to chance (Fig. 3). Overall, these results indicate that while most methods detected more symmetries than they missed, a substantial fraction of symmetries were not detected, especially for cases of internal symmetry. In addition, it is clear that identifying the boundaries of internal repeats is a non-trivial problem (Fig. 3). To better understand the specific situations in which each method performs well or poorly, we used MemSTATS to examine individual examples for each method (Fig. 1B-D, 2B-C), aside from the sensitivity to chain labelling order mentioned above (Fig. 1E).

7

AnAnaS, for example, due to its analytic framework, is limited to global cyclic, dihedral, or cubic symmetries, and therefore did not detect either open symmetries, such as helical or linear symmetries (e.g., PDB: 4HEA; Fig. 1B), or local symmetries, in which only a subset of chains in a complex are related (e.g., PDB: 3WMM, PDB: 2R6G, and PDB: 4Y28). All other methods can, in principle, detect such local and open symmetries. QuatSymm, on the other hand, failed for cases where the symmetry-related chains fell below the assigned threshold of 40% sequence identity (e.g., PDB: 1PRC, see Fig. 1C, PDB: 1ZOY, PDB: 2R6G, PDB: 4Y28, and PDB: 4HEA). Such cases were typically correctly identified by the internal symmetry algorithms, as long as the repeats constituted a large fraction of the complex (e.g., PDB: 2R6G but not PDB: 4HEA).

The ability to detect open symmetry makes internal symmetry algorithms vulnerable to obtaining a wellscoring self-alignment by just a slight translation of the structure. This so-called “slip” symmetry matches most secondary structure elements from the transformed structure to themselves and, hence, proteins with an abundance of straight parallel α-helices or β-strands, for example, were especially prone to eliciting this behavior (Fig. 2C). While the various thresholds assigned in SymD and CE-Symm attempt to mitigate the detection of false positives due to slip symmetry, the problem persisted for SymD in a number of cases (e.g., PDB: 4HEA and PDB: 2ZW3 chain A; Fig 2C).

Cases with symmetry mismatch proved to be inscrutable to all the abovementioned algorithms. Such cases included the structures of the AMPA (PDB: 4U1W) or NMDA (PDB: 4PE5) receptors, which besides the readily-detectable global two-fold symmetry of the complex, also exhibit two unrelated and distinct pseudo-symmetries relating the chain fragments in the membrane-embedded and extracellular components of the complex (Fig. 2 in [55]). These symmetries presented a challenge for AnAnaS and QuatSymm because neither method considers repeats formed by a fraction of a chain. Among the internal-symmetry detection methods, SymD is inherently unable to detect more than one symmetry in

8

a structure, while CE-Symm can only detect multiple symmetries simultaneously if those symmetries are hierarchical, i.e., they relate sub-regions of the same symmetric repeats. These limitations of the internal-symmetry detection algorithms also prevented them from identifying two distinct symmetries within a single protein chain that cannot be related to one another, such as those found in the protomer of a glutamate transporter homolog (GltPh, PDB: 2NWX_A; Fig. 2B).

One of the main challenges for internal symmetry algorithms is detecting the full extent of the repeats. First, since there is often much more sequence and structure divergence between repeats in internal symmetry than in quaternary symmetries, setting appropriate thresholds for determining whether a protein structure is symmetric and which residues are involved can be difficult. This issue was apparent for protein structures in which a pronounced structural divergence between repeats is of functional significance, such as in vcINDY (PDB: 4F35 chain C) [56,57]. Second, CE-Symm, due to its focus on detecting multiple, hierarchically-related symmetries within a structure, might only identify some of the residues of the repeats related by a given symmetry axis because the remainder do not satisfy the criteria of a second symmetry axis. Such an outcome was observed not only within protein chains (e.g. PDB: 3NCY chain A), but also for complexes that exhibit both quaternary and internal symmetry (e.g. PDB: 1OTS; Fig. 1D).

In conclusion, the MemSTATS data set provides a solid foundation for evaluating both the overall performance of symmetry detection algorithms and the specific challenges in membrane protein structures that each method can address. Moreover, since the data set was carefully curated to represent all types of symmetries specific to membrane-spanning protein chains, such as interdigitating repeats and different membrane orientations of the symmetry axes, we hope that it will prove a useful reference for studies of other aspects of membrane protein biology, such as folding and evolution. Nevertheless, the limitations of the data set, including its relatively small size, should be kept in mind

9

when interpreting the statistical significance of results obtained thereon. For example, at the quaternary level, more than one complex within the dataset has similar architecture (e.g., there are multiple examples with C4 quaternary symmetry, see Table S1 and Fig. S1), and therefore any issues that this specific architecture presents will be upweighted when judging the overall performance of an algorithm. Furthermore, only a few multi-chain proteins have no quaternary symmetry, while few β-barrel proteins have no internal symmetry; this imbalance means that measures such as accuracy should be avoided or used with care [58–60]. We also note that the determination of internal structural symmetry in β-barrels is highly ambiguous, because it is not fully apparent how the symmetry might relate to function; these cases were included in the dataset mainly for completeness. The abovementioned issues are unlikely to be solved by the increasing availability of membrane protein structures since it is likely that they are intrinsic to the types of symmetries that can be encountered within the membrane environment. Nevertheless, it is likely that the discovery of new architectures will pose novel and interesting challenges to symmetry detection algorithms.

10

Acknowledgements

This research was supported by the Division of Intramural Research of the NIH, National Institute of Neurological Disorders and Stroke. We thank the LOBOS administrative team of the National Heart, Lung and Blood Institute for computational support.

Appendix A. Supplementary data

The current release of the MemSTATS data set is available online at: https://doi.org/10.5281/zenodo.1345121. The dynamic version of the data set, as well as the Python code used to benchmark the results of the symmetry detection methods, is available online at: https://github.com/AntoniyaAleksandrova/MemSTATS_benchmark. The raw outputs of the benchmarked symmetry-detection algorithms can be found at https://doi.org/10.5281/zenodo.3228540.

References

[1]

[2]

[3] [4] [5]

[6] [7] [8]

T.J. Stevens, I.T. Arkin, Do more complex organisms have a greater proportion of membrane proteins in their genomes?, Proteins Struct. Funct. Genet. 39 (2000) 417–420. doi:10.1002/(SICI)1097-0134(20000601)39:4<417::AID-PROT140>3.0.CO;2-Y. A. Krogh, B. Larsson, G. von Heijne, E.L.. Sonnhammer, Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes, J. Mol. Biol. 305 (2001) 567–580. doi:10.1006/jmbi.2000.4315. T. Nugent, D.T. Jones, Transmembrane protein topology prediction using support vector machines, BMC Bioinformatics. 10 (2009) 159. doi:10.1186/1471-2105-10-159. S. Choi, J. Jeon, J.-S. Yang, S. Kim, Common occurrence of internal repeat symmetry in membrane proteins., Proteins Struct. Funct. Bioinf. 71 (2008) 68–80. doi:10.1002/prot.21656. D. Myers-Turnbull, S.E. Bliven, P.W. Rose, Z.K. Aziz, P. Youkharibache, P.E. Bourne, A. Prlić, Systematic detection of internal symmetry in proteins using CE-Symm, J. Mol. Biol. 426 (2014) 2255–2268. doi:10.1016/j.jmb.2014.03.010. L.R. Forrest, Structural symmetry in membrane proteins, Annu. Rev. Biophys. 44 (2015) 311–337. doi:10.1146/annurev-biophys-051013-023008. S. Balaji, Internal symmetry in protein structures: prevalence, functional relevance and evolution., Curr. Opin. Struct. Biol. 32 (2015) 156–66. doi:10.1016/j.sbi.2015.05.004. A. Oberai, Y. Ihm, S. Kim, J.U. Bowie, A limited universe of membrane protein families and folds, Protein Sci. 15 (2006) 1723–1734. doi:10.1110/ps.062109706.

11

[9] [10]

[11]

[12] [13] [14] [15] [16]

[17]

[18]

[19]

[20] [21]

[22] [23] [24]

[25] [26] [27]

[28]

I.D. Pogozheva, H.I. Mosberg, A.L. Lomize, Life at the border: adaptation of proteins to anisotropic membrane environment, Protein Sci. 23 (2014) 1165–1196. doi:10.1002/pro.2508. I.D. Pogozheva, S. Tristram-Nagle, H.I. Mosberg, A.L. Lomize, Structural adaptations of proteins to different biological membranes, Biochim. Biophys. Acta - Biomembr. (2013). doi:10.1016/j.bbamem.2013.06.023. I. André, C.E.M. Strauss, D.B. Kaplan, P. Bradley, D. Baker, Emergence of symmetry in homooligomeric biological assemblies., Proc. Natl. Acad. Sci. U. S. A. 105 (2008) 16148–16152. doi:10.1073/pnas.0807576105. D.S. Goodsell, A.J. Olson, Structural symmetry and protein function., Annu. Rev. Biophys. Biomol. Struct. 29 (2000) 105–53. doi:10.1146/annurev.biophys.29.1.105. P.G. Wolynes, Symmetry and the energy landscapes of biomolecules, Proc. Natl. Acad. Sci. U. S. A. 93 (1996) 14249–14255. doi:10.1073/pnas.93.25.14249. G. Maksay, O. Toke, Asymmetric perturbations of signalling oligomers, Prog. Biophys. Mol. Biol. (2014). doi:10.1016/j.pbiomolbio.2014.03.001. A.M. Duran, J. Meiler, Inverted topologies in a membrane: a mini-review, Comput. Struct. Biotechnol. J. 8 (2013) 1–8. doi:10.5936/csbj.201308004. B.E. Weiner, N. Woetzel, M. Karakaş, N. Alexander, J. Meiler, BCL::MP-fold: folding membrane proteins through assembly of transmembrane helices, Structure. (2013). doi:10.1016/j.str.2013.04.022. L.R. Forrest, Y.-W. Zhang, M.T. Jacobs, J. Gesmonde, L. Xie, B.H. Honig, G. Rudnick, Mechanism for alternating access in neurotransmitter transporters, Proc. Natl. Acad. Sci. 105 (2008) 10338– 10343. doi:10.1073/pnas.0804659105. C. Fenollar-Ferrer, L.R. Forrest, Structural models of the NaPi-II sodium-phosphate cotransporters, Pflügers Arch. - Eur. J. Physiol. 471 (2019) 43–52. doi:10.1007/s00424-018-2197x. N.H. Joh, T. Wang, M.P. Bhate, R. Acharya, Y. Wu, M. Grabe, M. Hong, G. Grigoryan, W.F. DeGrado, De novo design of a transmembrane Zn2+-transporting four-helix bundle, Science (80-. ). 346 (2014) 1520–4. doi:10.1126/science.1261172. S. Newstead, Symmetry and structure in the POT family of proton coupled peptide transporters, Symmetry (Basel). 9 (2017) 85. doi:10.3390/sym9060085. K. Khafizov, C. Perez, C. Koshy, M. Quick, K. Fendler, C. Ziegler, L.R. Forrest, Investigation of the sodium-binding sites in the sodium-coupled betaine transporter BetP., Proc. Natl. Acad. Sci. U. S. A. 109 (2012) E3035-44. doi:10.1073/pnas.1209039109. M. Rapp, E. Granseth, S. Seppälä, G. von Heijne, Identification and evolution of dual-topology membrane proteins, Nat. Struct. Mol. Biol. 13 (2006) 112–116. doi:10.1038/nsmb1057. M.H. Saier, Tracing pathways of transport protein evolution, Mol. Microbiol. 48 (2003) 1145– 1156. doi:10.1046/j.1365-2958.2003.03499.x. M.W. Franklin, S. Nepomnyachyi, R. Feehan, N. Ben-Tal, R. Kolodny, J.S. Slusky, Evolutionary pathways of repeat protein topology in bacterial outer membrane proteins, Elife. 7 (2018) 1–19. doi:10.7554/eLife.40308. S. White, Membrane proteins of known 3D structure, (n.d.). http://blanco.biomol.uci.edu/mpstruc/ (accessed April 19, 2017). A. V. Kajava, Tandem repeats in proteins: from sequence to structure, J. Struct. Biol. 179 (2012) 279–288. doi:10.1016/j.jsb.2011.08.009. I. Walsh, F.G. Sirocco, G. Minervini, T. Di Domenico, C. Ferrari, S.C.E. Tosatto, RAPHAEL: recognition, periodicity and insertion assignment of solenoid protein structures, Bioinformatics. 28 (2012) 3257–3264. doi:10.1093/bioinformatics/bts550. P. Do Viet, D.B. Roche, A. V. Kajava, TAPO: a combined method for the identification of tandem

12

[29]

[30] [31] [32] [33]

[34] [35] [36]

[37] [38] [39] [40] [41] [42]

[43]

[44] [45]

[46]

[47]

repeats in protein structures, FEBS Lett. 589 (2015) 2611–2619. doi:10.1016/J.FEBSLET.2015.08.025. R. Sabarinathan, R. Basu, K. Sekar, ProSTRIP: a method to find similar structural repeats in threedimensional protein structures, Comput. Biol. Chem. 34 (2010) 126–130. doi:10.1016/J.COMPBIOLCHEM.2010.03.006. T. Hrabe, A. Godzik, ConSole: using modularity of Contact maps to locate Solenoid domains in protein structures, BMC Bioinformatics. 15 (2014) 119. doi:10.1186/1471-2105-15-119. L. Hirsh, D. Piovesan, L. Paladin, S.C.E. Tosatto, Identification of repetitive units in protein structures with ReUPred, Amino Acids. 48 (2016) 1391–1400. doi:10.1007/s00726-016-2187-2. L. Hirsh, L. Paladin, D. Piovesan, S.C.E. Tosatto, RepeatsDB-lite: a web server for unit annotation of tandem repeat proteins, Nucleic Acids Res. 46 (2018) W402–W407. doi:10.1093/nar/gky360. R.G. Parra, R. Espada, I.E. Sánchez, M.J. Sippl, D.U. Ferreiro, Detecting repetitions and periodicities in proteins by tiling the structural space, J. Phys. Chem. B. 117 (2013) 12887–12897. doi:10.1021/jp402105j. K.B. Murray, W.R. Taylor, J.M. Thornton, Toward the detection and validation of repeats in protein structure, Proteins Struct. Funct. Genet. 57 (2004) 365–380. doi:10.1002/prot.20202. E.S.C. Shih, M.J. Hwang, Alternative alignments from comparison of protein structures, Proteins Struct. Funct. Genet. 56 (2004) 519–527. doi:10.1002/prot.20124. E.S.C. Shih, R.C.R. Gan, M.J. Hwang, OPAAS: A web server for optimal, permuted, and other alternative alignments of protein structures, Nucleic Acids Res. 34 (2006) 95–98. doi:10.1093/nar/gkl264. A.L. Abraham, E.P.C. Rocha, J. Pothier, Swelfe: A detector of internal repeats in sequences and structures, Bioinformatics. 24 (2008) 1536–1537. doi:10.1093/bioinformatics/btn234. H. Chen, Y. Huang, Y. Xiao, A simple method of identifying symmetric substructures of proteins, Comput. Biol. Chem. 33 (2009) 100–107. doi:10.1016/j.compbiolchem.2008.07.026. A. Guerler, C. Wang, E.W. Knapp, Symmetric structures in the universe of protein folds, J. Chem. Inf. Model. 49 (2009) 2147–2151. doi:10.1021/ci900185z. B. Chakrabarty, N. Parekh, PRIGSA: protein repeat identification by graph spectral analysis, J. Bioinform. Comput. Biol. 12 (2014) 1442009. doi:10.1142/S0219720014420098. C. Kim, J. Basner, B. Lee, Detecting internally symmetric protein structures., BMC Bioinformatics. 11 (2010) 303. doi:10.1186/1471-2105-11-303. S.E. Bliven, A. Lafita, P.W. Rose, G. Capitani, A. Prlić, P.E. Bourne, Analyzing the symmetrical arrangement of structural repeats in proteins with CE-Symm, PLOS Comput. Biol. 15 (2019) e1006842. doi:10.1371/journal.pcbi.1006842. V. Sojo, C. Dessimoz, A. Pomiankowski, N. Lane, Membrane proteins are dramatically less conserved than water-soluble proteins across the tree of life, Mol. Biol. Evol. 33 (2016) 2874– 2884. doi:10.1093/molbev/msw164. H.M. Berman, The Protein Data Bank, Nucleic Acids Res. 28 (2000) 235–242. doi:10.1093/nar/28.1.235. P.W. Rose, A. Prlić, C. Bi, W.F. Bluhm, C.H. Christie, S. Dutta, R.K. Green, D.S. Goodsell, J.D. Westbrook, J. Woo, J. Young, C. Zardecki, H.M. Berman, P.E. Bourne, S.K. Burley, The RCSB Protein Data Bank: views of structural biology for basic and applied research and education, Nucleic Acids Res. 43 (2015) D345–D356. doi:10.1093/nar/gku1214. L. Paladin, L. Hirsh, D. Piovesan, M.A. Andrade-Navarro, A. V. Kajava, S.C.E. Tosatto, RepeatsDB 2.0: improved annotation, classification, search and visualization of repeat protein structures, Nucleic Acids Res. 45 (2017) D308–D312. doi:10.1093/nar/gkw1136. T.J.P. Hubbard, A.G. Murzin, S.E. Brenner, C. Chothia, SCOP: a structural classification of proteins database, Nucleic Acids Res. 25 (1997) 236–239. doi:10.1093/nar/25.1.236.

13

[48]

[49]

[50]

[51] [52] [53] [54] [55] [56] [57]

[58]

[59]

[60]

C. Orengo, A. Michie, S. Jones, D. Jones, M. Swindells, J. Thornton, CATH – a hierarchic classification of protein domain structures, Structure. 5 (1997) 1093–1109. doi:10.1016/S09692126(97)00260-8. K. Shimizu, W. Cao, G. Saad, M. Shoji, T. Terada, Comparative analysis of membrane protein structure databases, Biochim. Biophys. Acta - Biomembr. (2018). doi:10.1016/j.bbamem.2018.01.005. S. Neumann, A. Fuchs, A. Mulkidjanian, D. Frishman, Current status of membrane protein structure classification, Proteins Struct. Funct. Bioinf. 78 (2010) 1760–1773. doi:10.1002/prot.22692. E.D. Levy, J.B. Pereira-Leal, C. Chothia, S.A. Teichmann, 3D complex: a structural classification of protein complexes, PLoS Comput. Biol. 2 (2006) 1395–1406. doi:10.1371/journal.pcbi.0020155. S. Pandit, J. Skolnick, Fr-TM-align: a new protein structural alignment method based on fragment alignments and the TM-score, BMC Bioinformatics. 9 (2008) 531. doi:10.1186/1471-2105-9-531. G. Pagès, E. Kinzina, S. Grudinin, Analytical symmetry detection in protein assemblies. I. Cyclic symmetries, J. Struct. Biol. 203 (2018) 142–148. doi:10.1016/j.jsb.2018.04.004. G. Pagès, S. Grudinin, Analytical symmetry detection in protein assemblies. II. Dihedral and cubic symmetries, J. Struct. Biol. 203 (2018) 185–194. doi:10.1016/j.jsb.2018.05.005. E. Karakas, H. Furukawa, Crystal structure of a heterotetrameric NMDA receptor ion channel, Science (80-. ). 344 (2014) 992–997. doi:10.1126/science.1251915. R. Mancusso, G.G. Gregorio, Q. Liu, D.N. Wang, Structure and mechanism of a bacterial sodiumdependent dicarboxylate transporter, Nature. 491 (2012) 622–626. doi:10.1038/nature11542. C. Mulligan, C. Fenollar-Ferrer, G.A. Fitzgerald, A. Vergara-Jaque, D. Kaufmann, Y. Li, L.R. Forrest, J.A. Mindell, The bacterial dicarboxylate transporter VcINDY uses a two-domain elevator-type mechanism, Nat. Struct. Mol. Biol. 23 (2016) 256–263. doi:10.1038/nsmb.3166. M. Sokolova, N. Japkowicz, S. Szpakowicz, Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation, in: A. Sattar, B. Kang (Eds.), AI 2006 Adv. Artif. Intell., Springer Berlin Heidelberg, Berlin, Heidelberg, 2006: pp. 1015–1021. D.M.W. Powers, Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation, J. Mach. Learn. Technol. 2 (2011) 37–63. http://www.bioinfo.in/contents.php?id=51 (accessed August 23, 2019). A. Tharwat, Classification assessment methods, Appl. Comput. Informatics. (2018). doi:10.1016/j.aci.2018.08.003.

14

Figure Legends

Figure 1. Quaternary symmetry detection in MemSTATS: summary results and examples. (A) Confusion matrices summarizing the performance of each algorithm at detecting quaternary symmetries in the α-helical (blue) or β-barrel (gray) transmembrane proteins in the MemSTATS dataset. SymD results were obtained using one of the two author-recommended thresholds: z-TM-score = 8. TN = true negatives, FP = false positives, FN = false negatives, TP = true positives. (B-E) Case studies discussed in the text and especially relevant for AnAnas (B), QuatSymm (C), CE-Symm (D) and SymD (E). (B) The membrane-embedded portion of Complex I (PDB: 4HEA), in which chains L (orange), M (blue) and H (yellow) are related by an open pseudo-symmetry not detectable by AnAnaS. (C) Chain L (orange) and chain M (blue) in the bacterial photosynthetic reaction center (PDB: 1PRC) are structurally pseudosymmetric (Cα RMSD ~1.9 Å) around the shown axis (black) but have a very low sequence identity (~22%), which prevents QuatSymm from detecting the pseudo-symmetry. (D) The dimer of the protonchloride antiporter ClC (PDB: 1OTS) is C2-symmetric, but each of its protomers also contains a C2pseudo-symmetry. CE-Symm detects all of these symmetries but only insofar as they are hierarchically related to the smallest repeated subunits (shown in different colors). Because the internal pseudosymmetries do not encompass all the residues in the protomers, CE-Symm does not report the full extent of the quaternary symmetry: the undetected portions are in gray. (E) Internal symmetry detection programs CE-Symm and SymD merge all chains in a complex according to the order in which they appear in the submitted coordinate file. If this sequence does not correspond to the structural proximity of the chains, the ability of these algorithms to detect even a perfect C4 symmetry, like the one found in a potassium KcsA channel (PDB: 1K4C), is hampered. For instance, in the example shown, SymD might be able to detect that the blue-colored subunits are related by a C4 symmetry transformation but would not find a match for the gray subunit.

15

Figure 2. Internal symmetry detection in MemSTATS: summary results and examples. (A) Confusion matrices summarizing the performance of each algorithm at detecting internal symmetries in the αhelical (blue) or β-barrel (gray) transmembrane proteins in the MemSTATS dataset. SymD results were obtained using one of the two author-recommended thresholds: z-TM-score = 8. For beta-barrel proteins, structural internal symmetry has poorly-understood functional or evolutionary meaning and, hence, these cases were included only for completeness in the benchmark and are based on number of strands. TN = true negatives, FP = false positives, FN = false negatives, TP = true positives. (B-C) Case studies discussed in the text and especially relevant for the internal symmetry detection methods. (B) Symmetries with independent axes in the same structure shown for chain A of the glutamate transporter homolog GltPh (PDB: 2NWX). None of the discussed detection algorithms can simultaneously detect both axes. (C) Example of so-called slip symmetry: a trivial superposition of chain A from the complex of a connexin 26 gap junction channel (PDB: 2ZW3) (blue) onto its copy that has been shifted by 20 residues (orange) can receive a high structural alignment score. Such slip symmetries are a frequent source of false positives for SymD.

Figure 3. Comparison of the performance of symmetry detection algorithms on the symmetries in the alpha-helical transmembrane proteins in the MemSTATS dataset. Informedness, also known as Youden’s index, is used as an unbiased metric of performance and is calculated as True Positive Rate (TPR) + True Negative Rate (TNR) – 1. Informedness ranges from 0 to 1, with 1 representing perfect predictive power of the algorithm and 0 indicating an inability to distinguish between true and false predictions. Colors indicate the four methods: green for AnAnaS, orange for QuatSymm, blue for CESymm, and gray for SymD computed with z-TM-score cutoff of 8. The lighter shade indicates results for which the symmetric order was detected correctly but only a fraction of the expected symmetry-related residues were recognized (with an allowed error of 20 consecutive residues). For SymD, changing the z-

16

TM-score cutoff to 10 increases the informedness for quaternary symmetries from 0.01 to 0.03 and decreases it for internal symmetries from 0.35 to 0.27.

17

A

B

AnAnaS



TN=3

FP=0

100%

0%

14%

86%

Total

13

60



TN=3

FP=0

100%

0%

+

+

FN=10 TP=60

FN=8 TP=62

3

TN=0

FP=1

0%

100%

70

FN=5

TP=5

50%

50%

73

5

6

11

TN=0

FP=0

0%

0%

0

1

10

Helical symmetry

Total

11%

89%

11

62

3

70

FN=0 TP=10 0%

100%

0

10

73

TN=3

FP=1

75%

25%

10

Low sequence identity

TN=0

FP=1

0%

100%

70

FN=5

TP=5

50%

50%

5

6

11

1

30%

70%

Total

4

24

50

74



+

FN=21 TP=49

C

10

CE-Symm –

TN=3

FP=7

TN=0

FP=1

30%

70%

10

0%

100%

70

FN=1

TP=9

10%

90%

1

D

10

Incomplete coverage

SymD

+ Total

Actual

QuatSymm

FN=20 TP=50

10

29%

71%

23

57

80

1

10

11



+

Total



+

Total

Predicted

E C

A

B

D

A

B

C

D

-

D

-

A

B

C

Chain order

CE-Symm

A –

TN=67 FP=18 79%

Total

+

FN=17 TP=31 35%

65%

84

49

48

133

B

TN=7

FP=3

70%

30%

FN=12 TP=5 71%

29%

19

8

10

Independent axes

17

27

SymD –

TN=63 FP=11 85%

+ Total

Actual

21%

85

15%

FN=24 TP=24 50%

50%

87

35



+

74

TN=4 FP=17

48

FN=15 TP=2

19%

81%

21

17

88%

12%

122

19

19

38

Total



+

Total

Predicted

C

Slip symmetry

CE

Quaternary

Sy m D

-S ym m

Sy m D

An An aS Q ua tS ym m CE -S ym m

TPR + TNR – 1 1.0

Internal

0.8

0.6

0.4

0.2

0.0

Highlights • • • • •

MemSTATS contains quaternary and internal symmetries in membrane proteins Symmetry descriptions include symmetry axis orientation relative to the membrane Four widely-used symmetry-detection methods exhibit complementary strengths Methods designed to detect quaternary symmetries are very reliable Internal symmetries are challenging, but MemSTATS points to areas for improvement