Patterns of Dynamics Comprise a Conserved Evolutionary Trait

Patterns of Dynamics Comprise a Conserved Evolutionary Trait

Journal Pre-proof Patterns of dynamics comprise a conserved evolutionary trait F. Zsolyomi, V. Ambrus, M. Fuxreiter PII: S0022-2836(19)30671-0 DOI: ...

6MB Sizes 0 Downloads 34 Views

Journal Pre-proof Patterns of dynamics comprise a conserved evolutionary trait F. Zsolyomi, V. Ambrus, M. Fuxreiter PII:

S0022-2836(19)30671-0

DOI:

https://doi.org/10.1016/j.jmb.2019.11.007

Reference:

YJMBI 66325

To appear in:

Journal of Molecular Biology

Received Date: 11 July 2019 Revised Date:

4 November 2019

Accepted Date: 13 November 2019

Please cite this article as: F. Zsolyomi, V. Ambrus, M Fuxreiter, Patterns of dynamics comprise a conserved evolutionary trait, Journal of Molecular Biology, https://doi.org/10.1016/j.jmb.2019.11.007. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.

Flexibility

CONSERVATION Sequence: 0.22 Flexibility: 0.76

AA Patterns of Dynamics (PODs) can be used to infer evolutionary relationships as an alternative to sequence or structure

Patterns of dynamics comprise a conserved evolutionary trait F. Zsolyomi1, V. Ambrus1, M Fuxreiter1* 1

MTA-DE Laboratory of Protein Dynamics, Department of Biochemistry and Molecular Biology, University of Debrecen, Hungary Correspondence: [email protected]

Keywords: protein dynamics, evolutionary relationship, structure conservation, intrinsically disordered proteins, divergent sequences Summary The importance of protein dynamics in function may suggest an evolutionary selection on large-scale protein motions. Here we systematically studied the dynamic characteristics in 2221 protein domains (58477 sequences) of the Pfam database. We defined the patterns of dynamics (PODs) based on the estimated NMR order parameters and the predicted degree of disorder and found a significant correlation between them in families of both structured and disordered protein domains. We demonstrate that conservation of dynamic patterns frequently exceeds conservation of sequence and is comparable to the patterns of hydropathy and nonspecific interaction potential. Similarity of dynamic patterns is weakly correlated to structure similarity and to the degree of disorder. We illustrate that POD alignments could be applied to sequentially divergent or intrinsically disordered regions. We propose that patterns of dynamics comprise a conserved evolutionary trait, which could be used to infer evolutionary relationships as an alternative to sequence and structure.

Introduction Protein motions are inherent to function [1, 2] and cover a wide range of timescales from picosecond-nanosecond amino acid side-chain reorientations to millisecond-second domain movements [3-5]. Motions in enzymes [6], flexible proteins [7] or supramolecular assemblies [8] sample different regimes and dynamics dominates the activity of conformationally heterogeneous [9] and intrinsically disordered (ID) [10] proteins. Extensive dynamics can also be retained in protein complexes [11, 12], resulting in fast conformational exchange in the bound state [1214]. Owing to their importance in function, conservation of given collective motions has been observed in RNases [15], cyclophilin [16], and the PAS superfamily [17]. In Ras GTPases for example, normal modes, which are related to the conformational switch between GDP and GTP bound forms have been maintained in evolution [18]. Conserved residue couplings [19, 20] were related to allosteric regulation [21, 22]. In addition, changes in dynamic modes were proposed to contribute to evolution of new functions [23, 24]. Dynamical characteristics of proteins can be maintained for a variety of reasons [25-27], out of which we focus on the evolutionary aspects [28, 29]. The question is whether the connection between dynamics and function could serve as a basis of evolutionary relationships. This problem is especially relevant in 1

case of intrinsically disordered (ID) proteins [10, 30], which exhibit high variability in sequence [31] and employ short, low-complexity motifs for interactions [27, 32]. This could be illustrated by the ID linker in human replication protein A; exhibiting negligible sequence conservation, despite the similar backbone flexibilities found in five species from three kingdoms of life [33]. In accord, flexibility patterns of 215 domains of the SCOP database [34] could reliably detect distant homologous proteins [35]. Along these lines, the c-Myc transcription factors show conserved signatures of predicted disorder, which could be employed to identify distant relationships [36]. In this work, we have performed a systematic analysis using different dynamics-related characteristics of protein domains in the Pfam database [37]. We defined the patterns of dynamics (PODs) based on the estimated N-H S2 order parameters [38] and the predicted degree of disorder [39]. Similarity of PODs was assessed using the Hidden Markov Model (HMM)-based alignments of protein domains [37], and were compared to those derived from patterns of hydropathy [40] and nonspecific interaction potential ('stickiness')[41] (Figure 1). We demonstrate that patterns of dynamics are highly conserved within protein domains in reference to alignments of corresponding randomized and scrambled sequences. Conservation of dynamic patterns frequently exceeds that of sequence, thus could be used to infer evolutionary relationships in sequentially divergent protein families. Conservation of dynamic patterns weakly correlates to structure similarity or to the degree of disorder. We propose that alignment of PODs could be employed to analyse evolutionary relationships from structured to intrinsically disordered proteins, as exemplified by relating a functional amyloid from yeast to a human transcription factor. Results Patterns of dynamics are highly correlated in protein families 2221 Pfam domain families containing 58477 sequences in total were analysed (Methods). The Pfam-T set covered a wide-range of properties in terms of length (7– 1371 AA), sequence conservation (2.0 – 94.8%) and the degree of disorder (0–100%) (Table S1, Figures S1, S2). Patterns of flexibility was determined by the Dynamine algorithm [38], whereas patterns of disorder were predicted by the Espritz NMR method [39] using the sequences of full proteins containing domains in the Pfam-T dataset. Patterns of hydropathy were computed using a Kyte-Doolittle scale [40] and patterns of nonspecific interaction potential were obtained using the 'stickiness' scale [41]. Alignment of patterns were generated by replacing amino acids in the HMM alignments with the predicted Dynamine and Espritz scores, hydropathy and stickiness values (Figure 1, Methods). Similarity of the aligned patterns were characterized in a pairwise manner. The median of the corresponding Pearson's correlation coefficients (PCCs) was used to assess the conservation of the dynamics, disorder, hydropathy or stickiness patterns of the domain (Methods). The patterns of hydropathy and stickiness exhibit a reasonable correlation of PCC = 0.52 using the Pfam-T domain sequences (Figure 2A, Table S2). The patterns of disorder exhibit a comparable correlation (PCC = 0.52), while the correlation between the patterns of flexibility, derived from the predicted NMR order parameters [38] is slightly higher (PCC = 0.56) (Figure 2B). The correlations between the domain patterns are only moderately affected by their flanking regions (Figure S3), indicating that it is a local property. Similarity between dynamic patterns were not affected by the family populations (Table S3, Figure S4). To assess the statistical significance of these correlations, scrambled (PfamS) and randomized (Pfam-R) domain sequences were generated (Figure 1). In the Pfam-S dataset, the amino acid compositions were retained, while in the randomized 2

set, frequencies of amino acids corresponded to the SwissProt database (Methods). Correlations between patterns of hydropathy and stickiness almost fully diminish on the scrambled and randomized sequences (Figure 2A, Table S2). In contrast, patterns of dynamics still exhibit a considerable similarity on the Pfam-S and Pfam-T datasets, reflecting a weaker dependence on sequence (Figure 2B, Table S2). The correlations between the patterns of flexibility, disorder, hydropathy and stickiness of Pfam-T sequences significantly exceed the correlations obtained on the alignments of scrambled and randomized sequences (p < 2.2 x 10-16) (Figures 2A, 2B; Tables S2, S4). These results suggest that the local dynamic characteristics of protein domains, similarly to structure-related features, are conserved. Sequentially divergent families exhibit conserved patterns of dynamics We then probed the relationship between the conservation of sequence and the patterns of dynamics. We observed a considerable correlation between the similarity of domain patterns and sequence using the Pfam-T dataset (Figure 2C). Correlation coefficients are lower for the patterns of flexibility and disorder than those obtained for hydropathy and stickiness (Table S5). In scrambled (Pfam-S) and randomized (Pfam-R) datasets, conservation of patterns of dynamics weakly correlate to sequence similarities, while such correlation almost fully diminishes in case of hydropathy and stickiness patterns (Figure 2C, Table S5). These results suggest that dynamical properties of protein domains are less tightly coupled to sequence conservation than hydropathy and nonspecific interaction potential. To further explore the relationship between the conservation of sequence and the similarity of domain patterns, we divided the domains in Pfam-T set into four groups (Figure 2C): Q1: low pattern similarity – high sequence conservation; Q2: low pattern similarity – low sequence conservation; Q3: high pattern conservation – low sequence similarity; Q4: high pattern conservation – high sequence similarity. Categories 'high' and 'low' were defined based on the distributions of these properties in the Pfam-T and Pfam-R datasets (Tables S2, S6, Methods). Patterns of dynamics were significantly correlated in majority of domains (> 95 %) in the Pfam-T dataset (Q3 – Q4 in Figure 2C, Table S7). Out of these cases, high sequence similarity (> 0.14) could be observed in 63.6 % of the domains (Q4, Figure 2C, Table S7). ~800 out of the 2221 Pfam-T domains possess divergent sequences, yet exhibit a significant correlation between their patterns of dynamics, hydropathy or stickiness (Q3, Figure 2C, Table S7). The correlation between dynamic patterns in protein families, where either sequence or dynamic pattern similarity is low (Q1, Q2, Q3 in Figure 2C) is significantly higher than those in scrambled or randomized sequences (Pfam-S, Pfam-R; Figure S5). Taken together, these results suggest that conservation of sequence may not be a pre-requisite for conservation of dynamic characteristics of domains. In accord, functional divergence, which is possibly reflected by the family size does not affect the correlation between dynamic properties (Table S3, Figure S4). Conservation of structure and dynamic patterns are weakly correlated Then we analyzed to what extent the conservation of structure determines the conservation of dynamic patterns. Domains with known structures were collected into the Pfam-PDB dataset (810 domains, Figure S6, Table S8) and were superimposed using the CATH-SSAP algorithm [42] (Figure 1, Methods). Structural comparison has been carried out in a pairwise manner within each protein family, and the corresponding normalized RMSD values [43] were averaged within each family (Methods). The domains in Pfam-PDB were divided into four groups based on the high or low similarity in structure and patterns of dynamics (Figure 3A, Methods). 3

Conservation of domain patterns in most cases coincides with high similarity of structure in the Pfam-PDB (> 80 %, Figure 3A, Table S9). The correlation between the conservation of domain patterns and conservation of structure is weak (R < 0.3), although significant (p < 0.05, Figure 3A, Table S10). Similarly weak correlation have been observed using the TM-score [44] as measure of structural similarity (R=0.1, p = 0.005). This suggests that dynamical characteristics, similarly to hydropathy and stickiness could serve as alternatives to structure to analyse evolutionary relationships. Patterns of dynamics are more conserved in intrinsically disordered regions A considerable fraction of domains (~ 10 %) exhibit large deviations in structure (backbone SIMAX > 5 Å [42]), while have a conserved patterns of dynamics (Figure 3A, Table S9). This indicates that conservation of dynamic properties and structure might not be coupled. This does not depend on the likely functional divergence of the family (Figure S4), nor the definition of the biological function (Figure S8). The functions of these domains could be linked to conformational changes, as in case of Cre recombinase [45] (phase integrase; PF00589). The structure of these domains is often less tightly packed, or contain regions without a well-folded structure [46]. For example, viral domains are enriched in disordered regions [47] and their action is often linked to the plasticity of their interactions [48, 49]. However, domains with conserved dynamic characteristics and structure exhibit a similar degree of disorder as compared to domains with divergent structures (Figures 3A, S9). Vice versa, no significant difference in structural similarity (SIMAX) could be observed between domains with high and low propensity of predicted disordered residues (ID < 0.1 or ID ≥ 0.3 [50]) in the Pfam-PDB dataset (Figure 3B).) We then performed a systematic analysis between the conservation of domain patterns and protein disorder (Figures S9-S11), which was computed by Espritz NMR [39]. We divided the domains in the Pfam-PDB dataset based on the fraction of disordered residues (structured: ID < 0.1; disordered: ID ≥ 0.3 [50], Methods) and analysed the conservation of different patterns in these two classes. We have also compared these two classes to intrinsically disordered domains (IDDs), functions of which are linked to the absence of a well-defined structure [51]. Flexibility and disorder patterns exhibit a significantly higher conservation in disordered than in structured domains (Figure 3C), while a less pronounced difference is observed between the hydropathy and stickiness patterns (Figure 3C). POD conservation of disordered domains in Pfam is comparable to IDDs (Figure S12). In accord, the correlation between the degree of disorder and the PCC values of flexibility and disorder patterns exceed those obtained for hydropathy and stickiness (Figure S12). The correlations between the degree of disorder and conservation of dynamic patterns are weak, yet significant (Figure S10). These results suggest that patterns of dynamics could be conserved even in the absence of a well-defined structure, and might be evolutionary characteristics to indicate functional similarity of intrinsically disordered protein regions. Patterns of dynamics can reveal evolutionary relationships Domains with conserved patterns of dynamics exhibit a broad range of molecular functions (Figures S13 – S15 Methods). Although highly populated families appear to be functionally more divergent, especially with low structure similarity (Figure S14), these parameters not have a considerable impact on POD conservation (Figures S4, S8). We specifically analysed domains, where patterns of dynamics are more conserved than sequence (Q3 versus Q4 in Figure 2C) or structure (Q3 versus Q4 in Figure 3A). Domains with conserved PODs and divergent sequences are 4

significantly enriched (p < 0.05) in catalytic (GO:0003824) and transporter (GO:0005215) activity and significantly depleted (p < 0.05) in nucleic acid (GO:0003677, GO:0003723), protein (GO:0005515) or ion binding (GO:0043167), enzyme regulatory (GO:0030234) or structural molecule (GO:0005198) activity as compared to domains with conserved sequences (Table S12). Domains, where conservation of dynamic patterns exceeds structural similarity are enriched in catalytic activities (GO:0003824) and protein binding (GO:0005515) and depleted in enzyme regulatory (GO:0030234) activity (Table S13). Interestingly, domains in the different categories (Q3 and Q4 in Figures 2C, 3A) exhibit comparable degrees of disorder (Figures S9, S11). These results implicate that patterns of dynamics could be employed to infer evolutionary information for a variety of functional classes, as we illustrate using the LID domain of adenylate kinase (Pfam PF05191). The LID binds ATP, and assists to transfer a phosphoryl group to nucleotide monophosphates. Upon substrate binding, the enzyme undergoes a large conformational change from open to closed state [52, 53] (Figure 4A). Although the LID domain exhibits a large variability in sequence (similarity = 0.22 for 13 organisms, 20 domains from bacteria to human), the patterns of flexibility by the Dynamine algorithm are strongly correlated (PCC=0.76, Figure 4B) in accord with its similar structural properties (Simax=0.785). This suggests that dynamic properties of the LID domain might be under positive selection to be able to mediate domain movements. Another illustrative example is the tRNA synthetase class I domain (PF00133), which is structurally highly divergent (Simax=32.3), yet exhibit similar dynamic patterns (PCC for Dynamine 0.98). This family has low sequence similarity (5.5 %) as well as low degrees of structural disorder (13.6%). Discussion and conclusions Evolutionary relationships of proteins can be inferred from sequence and/or structure. Sequence alignment methods are limited to higher degrees of similarity, whereas structure based methods require proteins with a well-defined fold. Loosely packed or intrinsically disordered regions are functionally important [30], but their alignment is problematic by either method owing to low degrees of similarity and the lack of structure. Patterns of dynamics (PODs) can be predicted from the primary sequence. PODs are dynamic "fingerprints", which provide a coarse description of local flexibility or actual disorder along the protein chain. The pattern of dynamics indicates local variations in mobility. For example, interaction motifs embedded in disordered regions usually have lower flexibility as compared to their flanking segments [54]. Protein regions, which link globular domains also often have distinct dynamic characteristics [33]. The accuracy of POD predictions [38, 39] is close to that of secondary structure algorithms [55]. Systematic analysis of Pfam domains demonstrates that patterns of dynamics, similarly to structure-based patterns are highly conserved (Figures 2A, 2B). Patterns of dynamics and sequence similarities are related, but PODs can also be conserved in case of sequentially divergent protein families (Figure 2C). Similarity of dynamic properties may parallel that of structure, and can even exceed structural similarity (Figure 3A). Furthermore, conservation of dynamic patterns weakly correlates to the degree of disorder (Figure S10). Thus, patterns of flexibility and disorder could be applicable to elucidate evolutionary relationships in disordered proteins, the function of which stems from dynamics [36]. These principles could be illustrated via analysing the relationship between a yeast protein Rlm1 and a human transcription factor Mef2D, both of which undergo a non-pathological aggregation. Mef2D is implicated in muscle and neuronal development, whereas as Rlm1 forms a functional amyloid in yeast [56]. Intrinsically 5

disordered protein regions (272-586 region of Rlm1 and 101-365 region of Mef2D) were found to be critical for aggregation in both cases [56], which exhibit low sequence similarity (30.4 %). The predicted patterns [38] however, could be aligned using an alignment method based on a Needleman-Wunsch algorithm resulting in a PCC = 0.741 (p = 1.7x 10-42) (Figure 4C). Thus, patterns of dynamics indicate a possible evolutionary relationship between the amyloid-forming segments, which cannot be inferred from sequence or structure. Taken together, our results suggest that PODs comprise a conserved protein trait that can be extracted from the primary sequence, and can be applied to evolutionary relationships even in case of highly dynamic or divergent sequences. This opens new perspectives to analyse evolution of intrinsically disordered protein regions or macromolecular assemblies undergoing significant conformational changes.

Acknowledgements We thank Daniel Jarosz (Stanford University) for sharing the data on functional amyloid-forming regions in Mef2D and Rlm1. M.F. is also grateful to Prof Dan Tawfik for fruitful discussions in the earlier phase of the project. F.Zs. thank to Rita Pancsa for technical help. This work was supported by HAS 11015 and GINOP-2.3.2-15-201600044.

Author contributions F. Zs. assembled the Pfam-T, Pfam-S, Pfam-R datasets, established the computational pipelines, and performed the statistical analysis and prepared the tables and figures, V. A. assembled the Pfam-PDB dataset, and performed the SIMAX analysis, M.F. designed and supervised the work, evaluated the results and wrote the paper.

Declaration of Interests The authors declare no competing interests.

6

References [1] Artymiuk PJ, Blake CC, Grace DE, Oatley SJ, Phillips DC, Sternberg MJ. Crystallographic studies of the dynamic properties of lysozyme. Nature. 1979;280:5638. [2] Frauenfelder H, Petsko GA, Tsernoglou D. Temperature-dependent X-ray diffraction as a probe of protein structural dynamics. Nature. 1979;280:558-63. [3] Frauenfelder H, Sligar SG, Wolynes PG. The energy landscapes and motions of proteins. Science. 1991;254:1598-603. [4] Henzler-Wildman K, Kern D. Dynamic personalities of proteins. Nature. 2007;450:964-72. [5] Lewandowski JR, Halse ME, Blackledge M, Emsley L. Protein dynamics. Direct observation of hierarchical protein dynamics. Science. 2015;348:578-81. [6] Henzler-Wildman KA, Lei M, Thai V, Kerns SJ, Karplus M, Kern D. A hierarchy of timescales in protein dynamics is linked to enzyme catalysis. Nature. 2007;450:913-6. [7] Boehr DD, Nussinov R, Wright PE. The role of dynamic conformational ensembles in biomolecular recognition. Nat Chem Biol. 2009;5:789-96. [8] Wu H, Fuxreiter M. The Structure and Dynamics of Higher-Order Assemblies: Amyloids, Signalosomes, and Granules. Cell. 2016;165:1055-66. [9] Lindorff-Larsen K, Best RB, Depristo MA, Dobson CM, Vendruscolo M. Simultaneous determination of protein structure and dynamics. Nature. 2005;433:12832. [10] van der Lee R, Buljan M, Lang B, Weatheritt RJ, Daughdrill GW, Dunker AK, et al. Classification of intrinsically disordered regions and proteins. Chem Rev. 2014;114:6589-631. [11] Mittag T, Orlicky S, Choy WY, Tang X, Lin H, Sicheri F, et al. Dynamic equilibrium engagement of a polyvalent ligand with a single-site receptor. Proc Natl Acad Sci U S A. 2008;105:17772-7. [12] Rosenzweig R, Sekhar A, Nagesh J, Kay LE. Promiscuous binding by Hsp70 results in conformational heterogeneity and fuzzy chaperone-substrate ensembles. eLife. 2017;6. [13] Tompa P, Fuxreiter M. Fuzzy complexes: polymorphism and structural disorder in protein-protein interactions. Trends Biochem Sci. 2008;33:2-8. [14] Fuxreiter M. Fuzziness in Protein Interactions-A Historical Perspective. J Mol Biol. 2018;430:2278-87. [15] Narayanan C, Bernard DN, Bafna K, Gagne D, Chennubhotla CS, Doucet N, et al. Conservation of Dynamics Associated with Biological Function in an Enzyme Superfamily. Structure. 2018;26:426-36 e3. [16] Holliday MJ, Camilloni C, Armstrong GS, Vendruscolo M, Eisenmesser EZ. Networks of Dynamic Allostery Regulate Enzyme Function. Structure. 2017;25:27686. [17] Pandini A, Bonati L. Conservation and specialization in PAS domain dynamics. Protein Eng Des Sel. 2005;18:127-37. [18] Raimondi F, Orozco M, Fanelli F. Deciphering the deformation modes associated with function retention and specialization in members of the Ras superfamily. Structure. 2010;18:402-14. [19] Popovych N, Sun S, Ebright RH, Kalodimos CG. Dynamically driven protein allostery. Nat Struct Mol Biol. 2006;13:831-8. [20] Tsai CJ, del Sol A, Nussinov R. Allostery: absence of a change in shape does not imply that allostery is not at play. J Mol Biol. 2008;378:1-11. [21] Ma J, Karplus M. Ligand-induced conformational changes in ras p21: a normal mode and energy minimization analysis. J Mol Biol. 1997;274:114-31. [22] Haliloglu T, Bahar I. Adaptability of protein structures to enable functional interactions and evolutionary implications. Curr Opin Struct Biol. 2015;35:17-23.

7

[23] Campbell E, Kaltenbach M, Correy GJ, Carr PD, Porebski BT, Livingstone EK, et al. The role of protein dynamics in the evolution of new enzyme function. Nat Chem Biol. 2016;12:944-50. [24] Baier F, Hong N, Yang G, Pabis A, Miton CM, Barrozo A, et al. Cryptic genetic variation shapes the adaptive evolutionary potential of enzymes. Elife. 2019;8. [25] Rychkova A, Mukherjee S, Bora RP, Warshel A. Simulating the pulling of stalled elongated peptide from the ribosome by the translocon. Proc Natl Acad Sci U S A. 2013;110:10195-200. [26] Warshel A, Bora RP. Perspective: Defining and quantifying the role of dynamics in enzyme catalysis. J Chem Phys. 2016;144:180901. [27] Smock RG, Gierasch LM. Sending signals dynamically. Science. 2009;324:198203. [28] Tokuriki N, Tawfik DS. Protein dynamism and evolvability. Science. 2009;324:2037. [29] Campbell EC, Correy GJ, Mabbitt PD, Buckle AM, Tokuriki N, Jackson CJ. Laboratory evolution of protein conformational dynamics. Curr Opin Struct Biol. 2018;50:49-57. [30] Wright PE, Dyson HJ. Intrinsically disordered proteins in cellular signalling and regulation. Nat Rev Mol Cell Biol. 2015;16:18-29. [31] Brown CJ, Johnson AK, Dunker AK, Daughdrill GW. Evolution and disorder. Curr Opin Struct Biol. 2011;21:441-6. [32] Davey NE, Cyert MS, Moses AM. Short linear motifs - ex nihilo evolution of protein regulation. Cell communication and signaling : CCS. 2015;13:43. [33] Daughdrill GW, Narayanaswami P, Gilmore SH, Belczyk A, Brown CJ. Dynamic behavior of an intrinsically unstructured linker domain is conserved in the face of negligible amino acid sequence conservation. J Mol Evol. 2007;65:277-88. [34] Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995;247:536-40. [35] Pandini A, Mauri G, Bordogna A, Bonati L. Detecting similarities among distant homologous proteins by comparison of domain flexibilities. Protein Eng Des Sel. 2007;20:285-99. [36] Mahani A, Henriksson J, Wright AP. Origins of Myc proteins--using intrinsic protein disorder to trace distant relatives. PLoS One. 2013;8:e75057. [37] Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 2016;44:D279-85. [38] Cilia E, Pancsa R, Tompa P, Lenaerts T, Vranken WF. From protein sequence to dynamics and disorder with DynaMine. Nature communications. 2013;4:2741. [39] Walsh I, Martin AJ, Di Domenico T, Tosatto SC. ESpritz: accurate and fast prediction of protein disorder. Bioinformatics. 2012;28:503-9. [40] Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982;157:105-32. [41] Levy ED, De S, Teichmann SA. Cellular crowding imposes global constraints on the chemistry and evolution of proteomes. Proc Natl Acad Sci U S A. 2012;109:204616. [42] Cuff A, Redfern OC, Greene L, Sillitoe I, Lewis T, Dibley M, et al. The CATH hierarchy revisited-structural divergence in domain superfamilies and the continuity of fold space. Structure. 2009;17:1051-62. [43] Kolodny R, Koehl P, Levitt M. Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J Mol Biol. 2005;346:1173-88. [44] Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins. 2004;57:702-10. [45] Kwon HJ, Tirumalai R, Landy A, Ellenberger T. Flexibility in DNA recombination: structure of the lambda integrase catalytic core. Science. 1997;276:126-31. 8

[46] Tompa P, Fuxreiter M, Oldfield CJ, Simon I, Dunker AK, Uversky VN. Close encounters of the third kind: disordered domains and the interactions of proteins. Bioessays. 2009;31:328-35. [47] Tokuriki N, Oldfield CJ, Uversky VN, Berezovsky IN, Tawfik DS. Do viral proteins possess unique biophysical features? Trends Biochem Sci. 2009;34:53-9. [48] Hagai T, Azia A, Babu MM, Andino R. Use of host-like peptide motifs in viral proteins is a prevalent strategy in host-virus interactions. Cell reports. 2014;7:1729-39. [49] Duro N, Miskei M, Fuxreiter M. Fuzziness endows viral motif-mimicry. Mol Biosyst. 2015;11:2821-9. [50] Gsponer J, Futschik ME, Teichmann SA, Babu MM. Tight regulation of unstructured proteins: from transcript synthesis to protein degradation. Science. 2008;322:1365-8. [51] Zhou J, Oldfield CJ, Yan W, Shen B, Dunker AK. Intrinsically disordered domains: Sequence disorder function relationships. Protein Sci. 2019;28:1652-63. [52] Muller CW, Schulz GE. Structure of the complex between adenylate kinase from Escherichia coli and the inhibitor Ap5A refined at 1.9 A resolution. A model for a catalytic transition state. J Mol Biol. 1992;224:159-77. [53] Muller CW, Schlauderer GJ, Reinstein J, Schulz GE. Adenylate kinase motions during catalysis: an energetic counterweight balancing substrate binding. Structure. 1996;4:147-56. [54] Fuxreiter M, Tompa P, Simon I. Local structural disorder imparts plasticity on linear motifs. Bioinformatics. 2007;23:950-6. [55] Garnier J, Gibrat J-F, Robson B. GOR secondary structure prediction method version IV. Methods Enzymol1996. p. 540-53. [56] Chakrabortee S, Byers JS, Jones S, Garcia DM, Bhullar B, Chang A, et al. Intrinsically Disordered Proteins Drive Emergence and Inheritance of Biological Traits. Cell. 2016;167:369-81 e12. [57] Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792-7. [58] Taylor WR, Orengo CA. Protein structure alignment. J Mol Biol. 1989;208:1-22.

9

Figure legends Figure 1 Flow-chart of the analysis. Patterns of flexibility, disorder, hydropathy and stickiness of Pfam domains were computed using sequences, which were derived from Swissprot (Pfam-T). As a reference, patterns were also determined for scrambled (Pfam-S) and randomized (Pfam-R) domain sequences (detailed in Methods). Patterns were aligned by replacing amino acids by the computed scores or descriptor values in the HMM alignment of Pfam [37]. In cases, where domain structures were available in PDB (Pfam-PDB), we analysed the relationship between structure and pattern similarities. Figure 2 Similarity of hydropathy, stickiness, flexibility and disorder patterns and their correlations to sequence conservation. A Pearsons correlation coefficients (median) for hydropathy and stickiness in Pfam-T (T), Pfam-S (S), PfamR (R) datasets (see also Table S2). B Median Pearsons correlation coefficients for flexibility and disorder in Pfam-T (T), Pfam-S (S), Pfam-R (R) datasets (see also Table S2). C Correlation between sequence and pattern similarity in Pfam-T (blue), Pfam-S (light gray), Pfam-R (dark gray) datasets for hydropathy, stickiness, flexibility (Dynamine) and disorder (Espritz). Q1: high sequence similarity – low pattern conservation; Q2: low sequence similarity – low pattern conservation; Q3: low sequence similarity – high pattern conservation; Q4: high sequence similarity – high pattern conservation. Figure 3 The relationship between similarities of structure and patterns of hydropathy, stickiness, flexibility and disorder. A Correlation between structure and pattern similarity in the Pfam-PDB dataset Q1: high structure similarity – low pattern similarity; Q2: low structure similarity – low pattern conservation; Q3: low structure similarity – high pattern similarity; Q4: high structure similarity – high pattern similarity. B Structure similarity in domains with different degrees of disorder (ID ≤ 10% light blue, ID > 30 % dark blue) in the Pfam-PDB dataset. Median Simax values are shown below the graph. C Median Pearsons correlation coefficients of domains with different degrees of disorder (ID ≤ 10% light blue, ID > 30 % dark blue, IDD [51] teal) in the Pfam-PDB dataset. Figure 4 Alignment of patterns of dynamics (PODs) may indicate an evolutionary relationship. A The LID domain of adenylate kinase exhibits two distinct conformations: a closed (PDB:1ake[52], cyan) and an open (PDB:4ake[53], marine) state with respect to the catalytic center (wheat, orange) The substrate is lime, the backbones are shown by light gray (closed state) and dark gray (open state). B Patterns of flexibility of the LID domains from 13 organisms could be aligned with PCC = 0.764 indicating a significant conservation of the dynamic properties. Sequences represented: KAD2 Bos taurus (144-179 aa), Dorosphila melanogaster (145-180 aa), Homo sapiens (142-177 aa), Mus musculus (142-177 aa), Neurospora crassa (168203 aa), Schizosaccharomyces pombe (131-166 aa), Saccharomyces cerevisiae (134169 aa), Arabidopsis thaliana (161-196 aa); KAD3 Arabidopsis thaliana (161-196 aa), Bos taurus (128-163 aa), Homo sapiens (128-163 aa), Mus musculus (128-163 aa), Saccharomyces cerevisiae (145-180 aa); KAD4 Arabidopsis thaliana (160-195 aa), Homo sapiens (126-161 aa), Mus musculus (126161 aa); KAD Bacillus subtilis (127-162 aa), Escherichia coli (123-158 aa), Francissella tularensis (123-158 aa), Photobacterium profundum (123-158 aa), Vibrio cholerae (123-158 aa). C Alignment of flexibility patterns of the 252-586 segments of Rlm1 (black) and 101365 segment of human Mef2D (red). For details, see Methods.

10

Methods 1. Datasets Pfam-T (true) dataset. HMM full alignments of the Pfam A dataset were downloaded from http://pfam.xfam.org/ [37] (Pfam 31.0 release 2017.10.29; nfamily = 16479, nprotein = 31051470) excluding families with “Unknown Function” (nDUF-family = 3531). Sequences have been cross-checked against reviewed UniProtKB/Swiss-Prot sequences (http://uniprot.org; UniProt release 2017.10.29). Sequences without evidence at protein level and transmembrane proteins have been excluded (GO:0016021). Domains with ambiguous assignments have also been excluded (Table S13). Only protein families with more than 10 members were used. We used the limit of 60 for the maximum number of proteins/family. In case of larger families, 60 sequences have been randomly selected. The Pfam-T dataset contained 2221 families and 58477 proteins (26.3 sequences/family) (Figure S1, Table S1). Pfam-S (scrambled) dataset. The domain sequence in each protein of the Pfam-T set was randomly permutated (i.e. “scrambled”), while the rest of the sequence remained unchanged. Scrambling was carried out using the random library of Python and was replicated 10 times/sequence. Dynamics-related parameters were computed for each sequence variant and were averaged. The Pfam-S set represented a case of conserved amino acid composition, without actual sequence conservation. Both PfamS and Pfam-T (below) datasets kept the original residue positions in the HMM alignment [37] (Figure 1). Pfam-R (random) dataset. The domain sequence in each protein of the PfamT set was randomly mutated, to reproduce naturally occurring amino acid frequencies (Table S15). Randomization was performed using the random Python library and was repeated 10 times/sequence. Results were averaged for the 10 replicates (Figure 1). Pfam-PDB (structure) dataset. All available PDB files containing the domain sequences in Pfam-T dataset were collected. The PDB sequences belonging to the same protein were aligned using MUSCLE algorithm [57]. The highest resolution PDB structure with the longest sequence of the domain was selected. Only PDB structures with < 2.5 Å were included, where at least 70% of the domain sequence was represented. The Pfam-PDB dataset contained 810 families and 6975 proteins (3-60 structures/family) (Figures 1, S6; Table S8). 2. Pattern analysis Calculation of dynamic patterns. We derived the patterns of dynamics (PODs) from the predictions by Dynamine, which estimates backbone flexibility based on N-H S2 order parameters using experimental data from the BioMagResBank [38]; Espritz NMR [39], which predicts the degree of disorder in the best agreement with experimental data. The pattern of dynamics was defined as a vector of the predicted scores, which was computed for the full protein sequence, where each value corresponds to one amino acid. PODs were compared to patterns of two descriptors: hydropathy [40] and nonspecific interaction potential ('stickiness')[41]. The thresholds for the different predictors/characteristics are summarized in Table S16. Correlations between these properties on the Pfam-T dataset are shown in Table S17. Generation of POD alignments Patterns of dynamics by Dynamine [38] and Espritz NMR [39] as well as patterns of hydropathy [40] and nonspecific interactions [41] have been computed for 11

each protein sequence in the Pfam-T, Pfam-S and Pfam-R datasets. Scores/values, which were determined on a residue basis were inserted into the Pfam HMM alignments [37] while all the gap positions were fixed. Conservation and family-based patterns were thus assessed using the columns of the original HMM alignments. 3. Assessment of conservation Sequence conservation Sequence conservation was computed in amino acid groups (Table S18) at each position of the HMM alignment: # !"

!$

(1)

%

where i is the position in the Pfam HMM alignment, nf is the number of most frequently occurring amino acids, which belong to the same group; ns is the number of non-gap positions in the same position of the HMM alignment, and N is the total number of columns in the HMM alignment (including all gap columns). Structure conservation Structure similarity was defined using normalized RMSD values [43]: 𝑅𝑀𝑆𝐷!*+, =

./01 ×3456

(2)

%78$

where RMSD is the root-mean-square deviation of the residues, Lmax is the length of the longest sequence and Nres is the total number of aligned residues. Domain structures in the Pfam-PDB dataset were superimposed using the CATH-SSAP algorithm [58]. Structural comparison has been carried out in a pairwise manner within each protein family, and the corresponding normalized RMSD values were averaged in each family. The protocol was validated using the CATH S35 dataset [42] (Figure S5). Pattern conservation Pattern similarity was quantified by the Pearson's correlation coefficient (PCC): 𝑃𝐶𝐶 =

;*< =,? @A @B

(3)

where cov (X,Y) is the covariance of variables X and Y, while 𝜎= and 𝜎? is the standard deviation of X and Y, respectively. Significant PCC values were defined as correlations having a p-value <= 0.05. PCC values did not follow a normal distribution, hence the median of the values was used to quantify the pattern conservation. 4. Analysis of disorder For analysis of the preference for intrinsic disorder, the Espritz NMR program has been used [39]. Residues were classified as ordered or disordered using the default threshold of 0.3089. Protein disorder was evaluated based on the fraction of disordered residues, and the limit of > 30% was applied for ID proteins, as previously suggested [50]. Ordered proteins were defined as the fraction of ID residues < 10 %.

12

5. Alignment of patterns of dynamics Patterns of dynamics were aligned using a novel method, which is based on the Needleman-Wunsch algorithm. The full protocol will be published and will be available upon request.

13

Figure 1

SEQUENCE ANALYSIS

STRUCTURE ANALYSIS

Pfam A

PDB

SwissProt

Alignments Pfam-T alignments Sequence alignments AKL.......GHEENV...KLSDPH ARLV.....GGEEQV...KLSDPH AKLH......GHENV.....LSPDH A.L.....VGHEENV....KLTEPH

Pfam-T alignment AHG.......PHKLES...DLNGEK ERSP.....KGHEQA...GLLVQD VPNH......GAKDL.....HSELH K.L.....AEPTENV....GLIHGH

MHR.......NFDLAQ...REHSPV WALK.....NDGKAN...SYRGHP GAKQ......EVHYV.....ALPSD L.K.....PEKHEQQ....GRTADS

Pfam-S alignment

Pfam-R alignment

Pattern alignments

Structures

Sequences

Pfam-PDB structures

Pfam-T sequences scrambling

randomization

domain/flanking

selection domain

domain/flanking

Pfam-S sequences

Pfam-R sequences

PDB sequences

Domain structures superposition

Individual patterns

Structure alignment Patterns of dynamics (PODs) Dynamine, Espritz

Structure-related patterns

Hydropathy, Stickiness Pattern alignment (Pfam-T)

Pattern (Pfam-PDB)

Pattern (Pfam-T)

superimposed domain structures (Pfam-PDB) Pattern alignment (Pfam-S)

Pattern alignment (Pfam-R)

Pattern (Pfam-S)

Pattern (Pfam-R)

Pattern alignment (Pfam-PDB)

Figure 2

A

B

C Q1

Q4

Q1

Q4

Q1

Q4

Q2

Q3

Q2

Q3

Q2

Q3

Q1

Q4

Q1

Q4

Q1

Q4

Q2

Q3

Q2

Q3

Q2

Q3

Q1

Q4

Q1

Q4

Q1

Q4

Q2

Q3

Q2

Q3

Q2

Q3

Q1

Q4

Q1

Q4

Q1

Q4

Q2

Q3

Q2

Q3

Q2

Q3

Figure 3

A

B

Q2

Q3

Q2

Q3

Q2

Q3

Q2

Q3

Q1

Q4

Q1

Q4

Q1

Q4

Q1

Q4

C

Figure 4

A

B

Flexibility

pattern correlation 0.764 sequence similarity 0.218

AA

Flexibility

C

AA

Highlights •

Coarse-grained dynamic characteristics comprise a conserved trait in Pfam domains



Divergent sequences exhibit conserved flexibility and disorder patterns



Similarities of structure and dynamic patterns are weakly correlated



Patterns of dynamics can indicate evolutionary relationship in disordered proteins