Compact Structure Patterns in Proteins

Compact Structure Patterns in Proteins

    Compact Structure Patterns in Proteins Bhadrachalam Chitturi, Shuoyong Shi, Lisa N. Kinch, Nick V. Grishin PII: DOI: Reference: S002...

2MB Sizes 1 Downloads 50 Views

    Compact Structure Patterns in Proteins Bhadrachalam Chitturi, Shuoyong Shi, Lisa N. Kinch, Nick V. Grishin PII: DOI: Reference:

S0022-2836(16)30289-3 doi: 10.1016/j.jmb.2016.07.022 YJMBI 65160

To appear in:

Journal of Molecular Biology

Received date: Accepted date:

24 July 2016 29 July 2016

Please cite this article as: Chitturi, B., Shi, S., Kinch, L.N. & Grishin, N.V., Compact Structure Patterns in Proteins, Journal of Molecular Biology (2016), doi: 10.1016/j.jmb.2016.07.022

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Compact Structure Patterns in Proteins

1

PT

Bhadrachalam Chitturi1,2,4‡, Shuoyong Shi2‡, Lisa N. Kinch3, Nick V. Grishin2,3* Department of Computer Science and Engineering, Amrita School of Engineering, Amritapuri,

2

RI

Amrita Vishwa Vidyapeetham, Amrita University, India

Departments of Biophysics and Biochemistry, University of Texas Southwestern Medical

3

SC

Center, 5323 Harry Hines Blvd, Dallas, TX 75390-9050, USA

Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, 5323

NU

Harry Hines Blvd, Dallas, TX 75390-9050, USA

Departments of Computer Science, University of Texas at Dallas, Richardson, TX 75083, USA



These authors contributed equally to this work

MA

4

*

D

Correspondence to: Nick V. Grishin, Email: [email protected]

TE

ABSTRACT

AC CE P

Globular proteins typically fold into tightly packed arrays of regular secondary structures. We developed a model to approximate compact parallel and antiparallel arrangement of α-helices and β-strands, enumerated all possible topologies formed by up to five secondary structural elements (SSEs), searched for their occurrence in spatial structures of proteins and documented their frequencies of occurrence in the PDB. The enumeration model grows larger supersecondary structure patterns (SSPs) by combining pairs of smaller patterns, a process which approximates a potential path of protein fold evolution. The most prevalent SSPs are typically present in superfolds such as the Rossmann-like fold, the ferredoxin-like fold and the Greek key motif, whereas the less frequent SSPs often possess split β-sheets, left-handed connections and crossing loops. In the PDB, we identified the previously undiscovered pretzel β-sheet and spiral β-sheet. This novel model allows us to discover theoretically possible SSPs that are absent in the PDB. All SSPs with up to four SSEs occurred in proteins. However, among SSPs with five SSEs, approximately 20% (218) are absent from existing folds. Of these unobserved SSPs, 71% (155) have unpaired β-strand(s); whereas 29% (63) lack unpaired β-strand(s) where 49 SSPs have two or more unfavorable features and 14 SSPs have one unfavorable feature. These 14 SSPs are of interest because many other SSPs with a single unfavorable feature are seen in the PDB. To facilitate future efforts in protein structure classification, engineering and design, we provide the resulting patterns and their frequency of occurrence in proteins at: http://prodata.swmed.edu/ssps/.

1

ACCEPTED MANUSCRIPT

PT

Keywords Secondary structure elements, super-secondary structure pattern, helix, strand, fold, superfamily, protein 3D structures, protein data bank (PDB),

INTRODUCTION

AC CE P

TE

D

MA

NU

SC

RI

The topological connectivity and arrangement of secondary structure elements (SSEs) in threedimensional space define protein folds. Identifying and enumerating common substructure motifs within folds, such as the helix-turn-helix1, the βαβ2, and the Greek key3; 4, have aided in predicting protein structure5; 6; 7; 8; 9 and function1; 10; 11 as well as understanding fold evolution12; 13 . These named substructures represent different super-secondary structure patterns (SSPs) that encompass two or more closely packed SSEs. SSPs can be defined by the order, connection topology, orientation, and packing of SSEs. The knowledge of SSP composition within folds helps us understand the structural and evolutionary relationships among proteins. Previous studies have revealed a number of frequently recurring SSPs within diverse folds, such as the Rossmann fold14, the β-grasp15, and the Greek key3; 16, and have established the value of using these common SSPs to outline structural relationships among large families. Databases such as CATH17 and SCOP18 have provided large-scale classification of protein structures according to these relationships. For instance, SCOP describes folds by conserved combinations of SSEs in the common structure core. The SSPs that occur with high frequency confirmed some basic rules of protein folding. For example, an investigation of crossover connections in β-sheets highlighted a strong preference for right-handedness in βαβ units19, while enumeration of β-sheet structures detected the absence of sheets with order 3142 or 2413, known as the 'pretzels'20. Distributions of open βsheets have suggested a preference for a lower number of β-strand pairs adjacent in sequence but separated in the β-sheet, i.e., 'jumps'6. Recently, the ability of these rules to dictate the probability of SSP occurrence in two-layer architectures (i.e. consisting of 2 planes) was evaluated, helping explain the limited number of SSE arrangements seen in protein structures21. These rules can also aid protein engineering and design. For example, fundamental rules such as chirality and orientation preference of SSPs, along with additional rules such as the angle between SSEs and loop length, were used to successfully guide the design of ideal proteins22. We denote the set of all SSPs composed of n SSEs as Sn; for instance, S3 consists of all SSPs with three SSEs. Numerous small domains or protein fragments, corresponding to SSPs consisting of two (S2) or three (S3) SSEs, are present in the PDB and the knowledge of their local interactions guides structure prediction. One popular hypothesis suggests that stable SSPs serve as folding nuclei23; 24; 25; 26. Accordingly, correctly recognizing such core SSPs may help ab initio structure prediction. Hidden Markov models (HMMs) for predicting SSP threedimensional context have been used to aid local structure prediction when a template is not available27. In CASP5, the FRAGFOLD server obtained the most accurate models for two new 2

ACCEPTED MANUSCRIPT

AC CE P

TE

D

MA

NU

SC

RI

PT

fold targets by assembling SSPs, using a simulated annealing algorithm9. SSP classification has also helped loop modeling. A library of small SSPs consisting of two SSEs linked by a loop, called SMotif28, has been used to reduce the loop search space by selecting both candidate loop fragments that match loop length and also ‘bracing SSEs’ that bound the loop and meet geometrical requirements29. Recently, SMotif28 and chemical shift information were combined to model larger structures8. Several theoretical models have been proposed to enumerate SSPs in protein folds. Owing to the difficulty of complete enumeration due to the very large number of possible SSPs, many of these models restrict their scope to various protein fold subsets. Early models describe α-helices packing onto β-sheets in a small subset of α/β folds30, β-strand orientations in packed βsheets31; 32, and α-helical arrangements in globular proteins33. Enumeration of β-strand arrangements in open β-sheets is widely studied3; 6; 20; 34. For instance, a systematic analysis of topology preference for four-stranded β-sheet patterns found that 42 out of 96 possible topologies were identified in protein structures, and 50% of these structures were covered by only four topologies34. SSPs in β-sandwich structures are have also been thoroughly investigated35; 36; 37; 38; 39; 40. A comprehensive survey of Greek key motifs among β-barrels and βsandwiches suggested basic rules that reflect their topological constraints and preferences35. Recently, models describing β-strand arrangements in β-sandwich structures have identified a characteristic feature among existing structures, termed ‘interlock’, and used it as a rule to distinguish and predict β-sandwiches38; 39; 40. Using such rules drawn from the analysis of recurring SSPs in proteins, Efimov proposed a method that models fold growth through stepwise addition of one SSE to a root structure pattern41. With this method, Efimov outlined possible folding pathways for five protein superfamilies of diverse folds. We propose a more general theoretical model of fold growth by generating all possible up-and-down, compact SSPs built on a hexagonal lattice. Here, up-and-down refers to antiparallel orientation of the successive (as dictated by the sequence) SSEs in an SSP. Compactness broadly requires that we form tight clusters without holes in the middle of the SSP or concavities along the contour (periphery); particularly, when an SSE is added to an SSP (with at least two SSEs) to obtain a larger SSP, the added SSE must be adjacent in the lattice to at least two of the SSEs of the SSP. Within the definition of compactness, additional rules for combining two SSPs both consisting of at least two SSEs are detailed in the Appendix, along with some additional exceptions. Instead of growing structures by the addition of a single SSE, we extended Efimov’s idea by treating larger SSPs as the combination of two smaller ones, i.e. structural tree construction41, where a new SSP was built by adding one additional SSE to the root SSP. However, Efimov’s root SSP was predefined with certain common patterns. For example, all-β structure enumeration was limited to SSPs containing a specific root comprised of only β-strands, and α/β structure enumeration was confined to SSPs containing a βαβ unit. Moreover, Efimov used certain strict rules to guide the SSE addition so that the resulting SSP was much more likely to occur in the protein database. Compared with Efimov’s work, our enumeration initiates from elementary SSEs (i.e. β-strand and α-helix) and grows without 3

ACCEPTED MANUSCRIPT

D

MA

NU

SC

RI

PT

preference for handedness or connection type. Thus, our SSPs are more comprehensive and enable identification of rare and unobserved SSPs in proteins. This idea of growing a larger SSP by combining smaller ones is a likely path of protein origin in nature12; 50; 51. Our model builds upon basic root SSPs (helix-helix, strand-strand, helix-strand, and strand-helix) considering rules such as compactness to generate SSPs consisting of up to five SSEs that are arranged in up to three layers. Each of the resulting SSPs, represented as a twodimensional matrix, was used to query existing structures in the PDB which are also represented as two-dimensional matrices. We used ProSMoS47 for this task, a method that allows for rapid identification40; 41. The majority of structures in the PDB can be detected by at least one SSP (Table 1). For each Sn, we assessed the percentage of SSEs in each structure matched by any SSP in Sn. We refer to this percentage as the SSE coverage. 72% of structures have at least half of their SSEs matched by SSPs in S5, and all SSPs in S2-S4 occur at least once in the PDB. To identify overrepresented and underrepresented SSPs, we compared the observed and expected frequency of SSPs (obtained by a theoretical model). Overrepresented SSPs tend to occur within superfolds, which is in agreement with previous findings13; 42; 43; 44; 45; 46, while underrepresented (and unobserved) SSPs tend to display one or more unfavorable structure features. The observed SSP frequency of occurrence is available at: http://prodata.swmed.edu/ssps/.

TE

MATERIALS AND METHODS

AC CE P

Our method is composed of three steps: 1) SSP generation, 2) computation of frequency of occurrence of every SSP in the PDB and 3) analysis that includes comparison of the observed and the expected frequency. Step 1 is executed by our model; Step 2 is executed with the help of PALSSE48 and ProSMoS47, where PALSSE decomposes a 3D protein structure into the constituent SSEs and ProSMoS47 establishes the interactions between the SSEs delineated by PALSSE48. We modeled topologies upon a hexagonal lattice, where a generic SSE, i.e. either a β-strand or an α-helix, was represented as a directed node on a point of the planar hexagonal lattice (Fig. 1). We call the obtained topologies skeletons. We retained only compact skeletons. Assignment of SSE type to each skeleton node and specification of interactions between every pair of SSEs yielded the set of SSPs. ProSMoS47 was employed to identify all occurrences of each SSP (a 2D matrix) in PDB (a set of 2D matrices). The exact match of the SSP is a motif hit. Depending on the SSEs of the protein that match an SSP, a motif hit is assigned to one of the SCOP superfamilies (within the first 7 classes). Each superfamily assignment was called a superfamily hit. One SSP can have at most one superfamily hit. A simple statistical model was developed to obtain the expected frequency of SSP occurrences, which corresponds to the number of ways that an SSP can be generated by assembling smaller SSPs and the corresponding frequencies of the smaller SSPs. We compared the observed and the expected frequencies to reveal overrepresented and underrepresented SSPs. The details of these procedures are outlined below.

4

ACCEPTED MANUSCRIPT Generating SSPs with a lattice

AC CE P

TE

D

MA

NU

SC

RI

PT

We generated SSPs on a hexagonal lattice, where each lattice point on an XY plane represented a place-holder for a generic SSE, i.e. a node (which is directed), that was oriented up or down along the Z axis (Fig. 1). Nodes were assembled into skeletons by linking numerically sequential SSEs in an anti-parallel fashion, where links model loops. SSEs on adjacent nodes of a given skeleton interact with one another. Adjacency is determined by the underlying lattice. In our model of SSP generation, we combine two smaller skeletons to obtain a larger one. Concatenation of two single nodes forms a two-node skeleton (represented by red arrows in Fig. 1a). Addition of a single node to any of the allowed lattice positions surrounding the two-node skeleton (criteria outlined below) forms a three-node skeleton, for example the right-handed skeleton shown in Fig. 1b. A four-node skeleton is assembled from either combining a threenode skeleton with a single node skeleton or combining a pair of two-node skeletons. A similar routine is implemented for assembly of five node skeletons. During the assembly process, we exclude 1) nodes with connections that cross the plane (example in Fig. 1d) and 2) nodes that result in non-compact skeletons (example in Fig. 1e). Details of compactness are elaborated in the Appendix. In brief, compactness was determined by three criteria. First, any lattice point that is not part of the skeleton can have at most three adjacent points in the skeleton. For example, given a four-node skeleton with two nodes in one layer and the remaining two in the adjacent layer, the shape of the skeleton can only be rhombus (compact) but not trapezoid (non-compact). In the trapezoid-shaped skeleton shown in Appendix Fig. S2c(ii), the central lattice point that is on the longer parallel side of the trapezoid which is not a part of the skeleton interacts with four nodes of the skeleton; thus, this skeleton is not compact. Second, when a new node is added to an existing skeleton, it should either expand an existing set of two or more collinear points (in the special case of β-strands) or it should be placed at a lattice point with at least two adjacent points in the existing skeleton (skeletons that violate this rule are shown in Appendix Fig. S2c(iii-iv)). Third, based on the number of SSEs, we eliminate certain arrangements of the skeleton points. For example, for five SSEs, even though both Appendix Figures S2d(i-ii) satisfy the above two conditions, we reject S2d(ii) in favor of S2d(i). Given an allowed skeleton, we assign SSE types (β-strand or α-helix) to nodes and specify interaction types for each SSE pair (hydrogen bond, other bond, or no interaction) to obtain SSPs. For the three-node skeleton in Fig. 1, one of the ten possible right-handed SSPs is illustrated in panel c, which contains three β-strands with the 1st β-strand hydrogen-bonded to the 3rd β-strand. In summary, from the lattice we obtain a skeleton which yields one or more corresponding SSPs. A combination of two smaller skeletons yields a larger skeleton. The workflow of the SSP generation is shown in Appendix Fig. S1. Meta-matrix PDB database representation, SSP search and filtering We constructed a database with ProSMoS47; 49 to represent the PDB (Dec 16, 2011, 75196pdb structures) as meta-matrices (matrices with additional information, i.e. interaction information, 5

ACCEPTED MANUSCRIPT

NU

SC

RI

PT

SSE length, etc.) of interacting SSEs, where SSEs were specified by PALSSE48. Each SSP was represented as a 2D matrix; here they are query matrices because we search for them. The query matrices specified handedness, minimal SSE length (e.g. at least five residues for a β-strand and eight residues for an α-helix), and interactions. A motif hit is a sub-matrix of a PDB chain that exactly matches the query matrix. For a single SSP, multiple motif hits were permitted per PDB chain, allowing overlap of elements. By design, two different SSPs will not have an identical motif hit. Note that the SSEs in a motif hit are adjacent in 3D but are not necessarily adjacent in the protein sequence. Thus, the connection between SSEs (that are consecutively numbered in the SSP query matrix) can denote any number of intervening SSEs of any length. ProSMoS47 identifies hits with low stringency; the results were filtered so that the motif hits resembled the theoretical SSP lattice more closely. The details of the filtering steps are given in Appendix Fig. S4.

MA

Calculation of observed frequencies

AC CE P

TE

D

We sought to limit the definition of superfamily to the SSEs present in the conserved structure core of the superfamily, in the context of observed superfamily frequency. To achieve this, we only counted superfamily hits if the query SSP was present in at least 50% of proteins of the superfamily. We call such counted superfamilies “core superfamilies”. The same criterion is used for the domain hits. For a given SSP in Sn, the observed superfamily frequency was calculated by dividing the number of superfamily hits of the given SSP by the total number of counted superfamilies found by all SSPs in Sn. In order to compare motif frequency with superfamily frequency fairly, the motif frequency for an SSP in Sn was calculated by dividing the number of SSP motif hits in the counted superfamilies by the total number of motif hits in the counted superfamilies for all SSPs in Sn. Calculation of expected frequencies from SSP decomposition For a given SSP with n SSEs, we split it into all possible n−1 size combinations of two smaller SSPs, i.e. n can be split into any of the combinations { {1,n−1}, {2, n−2},…, {n−1,1} }. The expected frequency of a larger SSP C was obtained from the two constituent smaller SSPS A and B as: = !" ∑#$" [ ∈ × ∈ ] where x=n−i. We tried two expected frequency measures which differ in the frequency value of a smaller SSP. The first approach that does not rely on observed frequencies is termed theoretical expected frequency. Here, since S1 is a set of indivisible SSPs, the expected frequencies of both an α-helix and βstrand are 1/2. Similarly, using the expected frequency of an α-helix and β-strand, we obtained a frequency of 1/4 for each SSP in S2, i.e. helix-strand, strand-helix, helix-helix and strand-strand. In the second approach, that we call observation-based expected frequency, the frequency of the smaller SSP is substituted by the actual observed frequency in the set of structures defined by the 6

ACCEPTED MANUSCRIPT

SC

RI

PT

larger SSP. For example, when calculating S2 expected frequencies, the observed β-strand and αhelix frequencies in all S2 SSP motif hits are used. The observed frequencies of β-strand and αhelix are different, which resulted in unequal expected frequencies for SSPs in S2. In both approaches, where applicable, the expected frequency was further divided. For S2 SSPs with two β-strands, we took into account the presence or absence of hydrogen bonding, splitting the expected frequency evenly between ββ with hydrogen bonds and ββ without hydrogen bonds. For larger SSPs, we took into account the impact of mirror symmetry; the resulting expected frequency was split evenly for a pair of SSPs with opposite handedness, such as a right-handed βαβ unit and a left-handed βαβ unit. Finally, each individual expected frequency was normalized by the sum of all expected frequencies.

NU

RESULTS

AC CE P

TE

D

MA

We generated all possible compact SSPs containing up to five SSEs. The number of SSPs in a given Sn exhibits exponential growth (Table 1): five in S2 and 1239 in S5. Unlike previous studies where enumeration was limited to a certain subset of protein folds30; 31; 33; 34; 35; 39 our SSPs admit more SSE arrangements. Structural classification (CATH17 and SCOP18) can be viewed as a summarization of SSPs present in nature, where the SSP relationships are established by structure comparison. However, studying SSPs in the opposite way, i.e. enumerating possible SSPs first and then examining their distribution in proteins, can provide additional insight into the protein folding rules that govern SSP occurrence and thereby limit the protein fold space. In particular, we can identify unobserved SSPs in protein structures. Each generated SSP in S1-S4 has at least one motif hit in the PDB. However, 228 SSPs of S5 (18%) are absent from the PDB (see Unobserved SSPs section). For each set of SSPs, we counted the number of proteins identified by at least one SSP. Table 1 shows that the generated SSPs cover a significant portion of the PDB (99.5%, 97.0%, 92.9%, 87.7% for S2-S5). However, a small percentage of proteins cannot be found by any SSP. For example, 2198 proteins that contain at least 3 SSEs were not covered by any S3 SSP (see Limitations section). A similar computation was performed on all SCOP domains in the first 7 classes to quantify the SSP presence in SCOP superfamilies. The percentages of identified SCOP superfamilies (97.5%, 90.6%, 83.5%, 73.1% for S2S5) is lower than the corresponding percentages of identified SCOP domains (99.0%, 95.2%, 90.0% and 83.9% for S2-S5). This discrepancy is due to the following reasons. First, only core superfamilies (defined in Methods) were considered when calculating the percentage of identified SCOP superfamilies. However, if we restrict the domain hits to the superfamily core, the percentage of identified domains decreases marginally to 98.9%, 93.5%, 85.5% and 72.2% for S2-S5. The remaining discrepancy is attributed to the uneven distribution of SCOP superfamily sizes. The superfamily counts are less redundant than their representative domains. For each Sn, we computed the SSE coverage for proteins in the PDB, where the coverage is defined as the number of non-redundant SSEs matched by any SSP of a given Sn, divided by the total number of SSEs of the structure. Fig. 2 highlights the coverage distributions. For S2, a 7

ACCEPTED MANUSCRIPT

SC

Overview of observed and expected frequency

RI

PT

majority of SSEs in the PDB structures are covered (80% of the structures are covered by >90% of SSPs); very few structures are uncovered. In fact, 99% of structures have at least half of the SSEs matched by S2 SSPs. For larger SSPs (S3-S5), the coverage is lower. For example, the percentage of all structures with at least half of their SSEs covered by some SSP in S5 (Table 1) is only 72. In an SSP, H and E 47 are used to denote an α-helix and a β-strand respectively, e.g. EHE denotes βαβ SSP of S3. Also sometimes we just denote an SSP by its identity, i.e. SSP 13. The number of SSEs in a given SSP is evident from the context. In sequel, we use this notation also.

AC CE P

TE

D

MA

NU

Observed frequencies based on motif hits and superfamily hits are computed. The aim of the superfamily frequency is a) to eliminate structure database bias due to the presence of multiple entries of highly similar proteins52, and b) to counter the occurrence of SSP repeats within PDB structures. For instance, a TIM barrel has seven right-handed βαβ motif hits, with consecutive motif hits overlapping by one β-strand. The observed superfamily frequencies should better reflect fold space. Expected SSP frequency is determined by two factors: the number of combinations in which various smaller SSPs can assemble and the frequency of the smaller SSPs. SSPs that are assembled from more combinations of abundant smaller SSPs have a higher expected frequency, and vice versa. The expected frequency approximates the idea that larger protein folds represent an assembly of reusable smaller SSPs in protein evolution13; 24. Consider βαβ, an S3 SSP, both the left-handed βαβ and the right-handed βαβ of S3 are expected to have corresponding frequencies based on the frequencies of their respective constituent SSPs, but the fact that one is left-handed and the other is right-handed is not taken into account. So, our expected frequency model is simple where the contribution of specific features of the larger SSP, including handedness, strand order of β-sheets and topological connections, are not taken into account. Consequently, the observed frequency correlates moderately with the expected frequency (correlation coefficients vary from 0.3 to 0.7). Nevertheless, the expected SSP frequency provides a rough baseline for comparison with the observed SSP frequency and it provides an opportunity to study the impact of handedness preference, β-sheet strand order and topology connections in structures. Frequencies of α-helix and β-strand Our theoretical expected frequency assumes an equal frequency of 1/2 for both α-helices and βstrands. Note that as the members of S1, i.e. the α-helix and the β-strand, are indivisible, there is no associated observation-based expected frequency. However, the observed frequency is 43.5% for a helix and 56.5% for a β-strand. In terms of assembling larger folding units from stable smaller ones, perhaps the theoretical frequencies of the elements should rather reflect those of the simplest stable supersecondary structure units2; 4; 53: the α-hairpin, β-hairpin, and βαβ unit. Remarkably, theoretical frequencies (see Methods) for the α-helix (3/7=42.9%) and β-strand 8

ACCEPTED MANUSCRIPT

AC CE P

TE

D

MA

NU

SC

RI

PT

(4/7=57.1%) calculated from these simple stable SSPs agree with those observed in the database with an error of just 0.6%. When considering the observed frequency of elements within superfamilies, the results show an opposite trend of 55.8% for α-helix and 44.2% for β-strand. This trend might reflect the relative difficulty of classifying α-helical structures with respect to the other main classes. As opposed to structures containing β-strands, whose interactions are more rigidly dictated by backbone hydrogen bonds, α-helical interactions tend to vary by their contacts, tilt angles and packing arrangements. To accommodate these fine-grained structure features, SCOP splits all αhelical folds and contains more superfamilies in the all α class (507 superfamilies) than in the all β class (354 superfamilies). However, normalizing the α-helix and β-strand counts by the frequency of SCOP superfamilies among the main classes (all α, all β, α/β and α+β) only slightly alters the frequencies of α-helix (53.8%) and β-strand (46.2%). Thus, the abundance of superfamilies in the all α class alone does not explain the apparent overabundance of α-helices calculated at the superfamily level. Table 2 documents the ratio of α-helices and β-strands detected in superfamilies from each of the main SCOP classes. The ratios for α-helices present in the all α, α/β and α+β classes and for β-strands found in the all β, α/β and α+β classes are all at or near identity (ranging from 0.99 to 1). The ratio of superfamilies belonging to the class all-α that contain β-strands is 0.13. In contrast, the ratio of superfamilies belonging to the class all-β that contain α-helices is significantly higher at 0.62, yielding an explanation for the observed superfamily frequencies of α-helix and β-strand. This might suggest that it is relatively easy to add α-helices to existing all-β structure cores as opposed to adding β-strands to existing all-α structure cores. Analysis of S2 hits

The expected S2 frequencies based on superfamily hits and motif hits are summarized in Table 3. The theoretical expected frequencies yield identical values for each of the SSPs, with the exception of the split EE based on the presence of a hydrogen bond, while the observation-based expected frequencies vary according to the component SSEs. For example, the expected frequency of HH is 16.2% while the expected frequency for HE is 24.1%. In order to compare these two methods, we measured the difference between the expected frequency and observed frequency for each method (Table 3, * marks expected calculation closer to observed). The expected motif frequencies obtained by the observation-based expected frequency are closer to the observed motif frequency for three (out of five) SSPs (EE, HH and HE). Similarly, the expected and observed superfamily frequencies of three SSPs (EE, EH and HE) more closely match the observation-based expected frequency. Thus, employing observed frequencies of the constituent SSPs in the expected frequency enhances accuracy. The trend is more pronounced in larger SSPs. Therefore, in our analysis of S2-S5, we will consider only the observation-based expected frequency. S2 consists of five SSPs: EH, HE, HH, hydrogen-bonded EE and non-hydrogen-bonded EE (frequencies in Table 3). The three simplest units of supersecondary structure analyzed by 9

ACCEPTED MANUSCRIPT

AC CE P

TE

D

MA

NU

SC

RI

PT

others include the β-hairpin, α-hairpin and βαβ units2; 4; 13; 53. Two of these stable units are included in the S2 SSPs, i.e. EE with hydrogen bonds (β-hairpin and the β-strands of βαβ) and HH (α-hairpin), while EH and HE alternate in the βαβ unit. The SSP EE, which does not have hydrogen bonds, represents a special case that does not exist as an independent unit (each βstrand requires hydrogen-bonding to another β-strand not in the SSP), but can be used in assembling larger SSPs. It is observed much less commonly than the others (Table 3). EH and HE have identical expected frequencies. However, the observed motif frequency of EH (25.04%) is higher than that of HE (22.49%). The prevalence of the classic Rossmann-fold (doubly-wound) that usually starts from a β-strand and ends with an α-helix and the TIM barrel where EH and HE overlap with each other are causal factors. For instance, a typical P-loop containing nucleoside triphosphate hydrolase has 5 EH units and 4 HE units. However, both HE and EH have a lower observed superfamiliy frequency (20.56%), which is lower than that of αhairpins and β-hairpins. Potentially, the discrepancy is caused by the clustering of Rossmann-like proteins into large SCOP superfamilies. For instance, a single superfamily of P-loop containing nucleoside triphosphate hydrolases contains 2433 domains with 26104 EH motif hits. This classification is reflected in the SCOP superfamily counts (Table 4): SCOP has less superfamilies in the α/β class (244 superfamilies) than in the all-β class (354 superfamilies). EE displays the highest observed superfamily frequency (24.7%), and the second highest observed motif frequency (22.7%) among all S2 SSPs. These frequencies are higher than the expected values of 14.6% and 17.8%, respectively. EE is in fact a combination of EE with and without hydrogen bonds, thus its observed motif frequency is higher. Further, most of the nonbonded EE motif hits correspond to nearby β-strand pairs in two different sheets of β-sandwich structures. This also leads to higher than expected observed superfamily frequencies. When compared to the parallel β-sheet configuration established by the EH and HE motifs, the antiparallel β-hairpin represented by EE is the energetically more favored β-sheet configuration due to the well-aligned hydrogen bonds3. The HH (α-helical hairpin) motif shows the second lowest observed motif frequency (15.7%), which is close to its expected motif frequency of 16.2% (lowest). This correlates with the lower number of α-helix hairpin repeating units present in α-helical proteins. The superfamily with the largest number of motif hits found by HH is the Globin-like13, but it contains only 3 pairs of α-helical hairpins that are perpendicular to each other and packed in a folded leaf topology. Alternatively, the observed superfamily frequency of HH ranks the second highest (23.66%) among S2 SSPs (higher than that of EH and HE), which is predicted by its expected superfamily frequency (second highest, 21.07%). The higher superfamily frequency correlates with the prevalence of superfamilies belonging to the all-α class (507) and a very high percentage (62%) of the superfamilies belonging to the all-β class contain α-helices.

10

ACCEPTED MANUSCRIPT Analysis of S3 hits Superfamily frequency reflects fold space better

AC CE P

TE

D

MA

NU

SC

RI

PT

In S3, the 3-stranded β-meander (Fig. 3, SSP 1) has the highest observed motif frequency, which parallels its expected frequency. The expected frequency is higher than other SSPs because 1) the 3-stranded β-meander is mirror-symmetric to itself, so the procedure of dividing the expected frequency between mirrors (mirror symmetric SSPs) does not apply; 2) the 3-stranded βmeander is formed in two ways, with each combining a β-strand and a β hairpin; and 3) the observed frequency of β-hairpin is the highest among S2 SSPs. The 3-stranded β-meander is very often in the form of overlapping repeats in larger β-sheets. For instance, in a Porin protein composed of 14-22 β-strands (e.g. pdb 2por), 13-21 repeats of a 3-stranded β-meander can be found with the edge β-strand of one repeat overlapping the edge β-strand of the adjacent repeat. Accordingly, we observed 1878 3-stranded β-meander SSP hits distributed in 101 domains of the Porin superfamily. The high observed motif frequency is a consequence of repetitive occurrences of the query SSP. For a similar reason, the right-handed βαβ (Fig. 3, SSP 13) and the righthanded αβα (Fig. 3, SSP 19) also have high observed motif frequencies, since they represent overlapping repeats in α/β proteins, especially in Rossmann fold-like proteins. The ratio of observed motif frequency between the right-handed βαβ and the right-handed αβα is 1.6 (14.4% vs. 9.2%). This ratio coincides with that (5/3) estimated from a typical Rossmann-fold protein with a 6-stranded β-sheet flanked by three helices on each side, in which we can identify 5 righthanded βαβ units and 3 right-handed αβα units. The right-handed 3-helical bundle (Fig. 3, SSP 23) also displays high observed motif frequency. We observed that 23% of the motif hits occur in right-handed superhelical proteins with three α-helices per turn such as ARM repeat (e.g. pdb 1b3u chain B). In summary, motif hit frequency tends to feature the SSPs that are ideal building blocks of repeats. The SCOP superfamilies covered by each SSP are shown in Fig. 3, right panel. We noted that the 3-stranded β-meander, the right-handed βαβ and the 3-helical bundle remain on top in observed superfamily frequency. Compared to the ratio of motif frequencies (14.4/9.2), the observed superfamily frequency ratio of right-handed βαβ to left-handed αβα (6/4.8) is less, since the redundancy of motif hits in Rossmann-fold like proteins is removed. The observed superfamily frequency of 3-helical bundles is larger than the observed motif frequency, implying a diverse distribution of 3-helical bundle in SCOP. We noticed that the number of superfamilies with 3-helical bundles (392 superfamily hits) is slightly larger than the number of superfamilies with right-handed βαβ units (384 superfamily hits). This is consistent with the fact that a 3helical bundle can exist as a standalone domain, whereas a right-handed βαβ is usually dependent. Eight superfamilies use a 3-helical bundle as their complete structure core, including Methane monooxygenase hydrolase (e.g. pdb 1xvb, chain F) and Duffy binding domain-like (e.g. pdb 1zrl, chain A). SSPs 10, 11, 16 and 17 (Fig. 3) represent all the S3 SSPs that can be obtained by combining a β-hairpin with an α-helix. These SSPs are not overrepresented when evaluated by their motif frequency (Fig. 3). However, their observed superfamily frequency is higher. These 11

ACCEPTED MANUSCRIPT

TE

D

MA

NU

SC

RI

PT

SSPs are usually present as subsets of larger α+β folds. In particular, SSPs 10, 11 and 16 match different elements of a ferredoxin-like fold (α+β sandwich with antiparallel β-sheet in strand order 4132, βαββαβ, βαβ×2). For example, SSP 16 matches the first α-helix and the following βhairpin. The ferredoxin-like fold is a superfold present in many α+β sandwich domains with diverse molecular functions45; 54. In addition to the ferredoxin-like fold defined by SCOP (59 superfamilies), SSPs 10, 11 and 16 are also present in various circular permutations55 of the ferredoxin-like fold, such as Protein kinase-like (pdb1uca chain A), MutS N-terminal domainlike (pdb1oh8 chain A:2-116) and DCoH-like (pdb 1dcp chain B) folds. The observed superfamily frequencies of partial ferredoxin-like SSPs rank higher than that of the right-handed βαβ SSP. Interestingly, SSP 10 is a left-handed ββα and SSP 11 is its mirror, i.e. right-handed. The observed superfamily frequency of SSP 10 is greater than SSP 11 (8.2% vs 6.3%), which supports a previous finding that the left-handed ββα is more prevalent than the right-handed ββα56. Several well-populated folds contain SSP 10 but exclude SSP 11, such as the β-Grasp (ubiquitin-like) and OB-fold. The observed superfamily frequency removes the biases (of observed motif frequency) caused by repeated SSPs in proteins and the existence of redundant homologous proteins. The highly repeated right-handed βαβ and αβα SSPs (SSP 13 and SSP 19, respectively) that appear as outliers in the S3 SSP motif frequency graph are closer to the identity line in the superfamily graph (Figure 3). Thus, superfamily frequency better represents the distribution of SSPs among folds. This justifies our use of superfamily frequencies.

AC CE P

Right-handed βαβ and ααα SSPs are more prevalent In general, our results show that mirrors occur at different frequencies. The observed motif frequency of a right-handed βαβ (9.2%) is close to 2.5 times that of a left-handed βαβ (3.7%), which concurs with the preference of right-handed crossover connections in a β-sheet19. However according to superfamily frequencies, the ratio between right-handed and left-handed βαβ SSPs is smaller than we originally expected, i.e. 1.5 (4.8% vs 3.1%). The relatively high frequency of the left-handed βαβ SSP is explained by the fact that our search allows intervening SSEs in the protein that are not present in the SSP, thus not a part of the hit. So a left-handed βαβ SSP can match many Rossmann-like superfamilies. For example, one is found in the NAD(P)binding superfamily (e.g. pdb 2yw9, chain B:1-256).The SSP includes the first βα hairpin (residues 9-34) and the initial β-strand after Rossmann-fold crossover (residues 88-93), with four intervening SSEs that do not belong to the SSP (residues 35-87). If we prohibit intervening SSEs, the ratio between right-handed and left-handed βαβ SSPs increases from 1.5 to 77.7. In this scenario, a left-handed βαβ is only present in three superfamilies: Undecaprenyl diphosphate synthase, Ribosome binding protein Y and Peptidoglycan deacetylase N-terminal noncatalytic region. The observed motif frequency of a right-handed 3-helical bundle (4.7%) is two times higher than that of a left-handed 3-helical bundle (2.3%). The superfamily frequencies for the same have a ratio of 1.2; however, if we disallow intervening SSEs, then the ratio is 1.7. Our observations agree with a previous finding that the frequency of right-handed 3-helical bundles is 12

ACCEPTED MANUSCRIPT

RI

PT

1.6 times higher than that of left-handed 3-helical bundles57. The left-handed 3-helical bundle is not disfavored as much as the left-handed βαβ (1.7-fold vs. 77-fold, respectively) due to a righthanded β curvature that is observed in β-sheets58 that does not apply to α-helical proteins. A “phone cord effect” hypothesis, where the torque generated during α-helix formation will pull the right-handed contacts together 57, was suggested to explain the slight preference of righthanded α-helical connections over the left-handed ones.

SC

Analysis of S4 hits Superfolds are made of more frequent SSPs

AC CE P

TE

D

MA

NU

The observed superfamily frequencies of SSPs from S4 and S5 reveals that the overrepresented SSPs often originated from superfolds. In S3, we observed an overrepresentation of several SSPs, including the 3-stranded β-meander, the partial ferredoxin-like fold SSPs and the right-handed βαβ unit (SSPs 1, 10, 11, 16 and 13, Figure 3). S4 SSPs that are composed of these S3 SSPs also show high observed superfamily frequencies. For example, the 4-stranded β-meander (SSP 1, Fig. 4) holds the highest observed and expected frequencies. The 4-stranded β-meander represents the repeat unit of the β-propeller fold (4-10 repeats), which exhibits considerable diversity of sequence and function and is found in 32 different superfamilies (10% of the superfamilies found by 4-stranded β-meander). Also similar to S3, partial ferredoxin-like fold SSPs show high observed frequencies, including S4 SSPs 124, 96 and 190 (Fig. 4). For a ferredoxin-like SSP (βαββαβ, i.e. βαβ×2), SSP 124 matches the first four SSEs (βαββαβ, matched SSEs marked by underscores); SSP 96 matches the first β-strand and the second βαβ unit (βαββαβ); and SSP 190 matches the four SSEs in the middle (βαββαβ). In particular, SSPs 124 and 96 are called split βαβ units which are present in 30% of α+β folds43. The tendency of a larger SSP to reflect the trends of its components agrees with an evolutionary strategy that new folds are created by combining existing structural units. In S4, several SSPs formed by extending βαβ SSPs, such as SSP 145, SSP 199 and SSP 146 (Fig.4), are found in Rossmann-like folds with high observed superfamily frequency. SSP 145 (βαβα) and SSP 199 (αβαβ) represent superhelical-like extensions of the right-handed βαβ unit, which are also present in TIM barrels. SSP 146 adds an α-helix to the right-handed βαβ unit on the side of the β-sheet opposite to the existing α-helix. The added α-helix serves as the crossover connection of the Rossmann-like fold59; 60. Besides the SCOP Rossmann-like fold (in a narrow sense, c.2 in SCOP), SSP 146 is also present in other superfamilies such as the MTH938like domain (e.g. pdb 2ab1, chain A:2-122) and the DNA polymerase III psi subunit domain (e.g. pdb 1em8, chain B). S4 SSPs 140 and 142 (Fig. 4) can also be assembled by adding one SSE to the βαβ unit. However, both of these SSPs possess unusual structural features such as crossing loops and lefthandedness. SSP 142 adds an α-helix to the right-handed βαβ so that the two helices are adjacent, but the two connections cross over each other. Such a crossing loop is expected to be rare in structures21; 25. However, similar to right-handed βαβ preference, the observed frequency of SSP 13

ACCEPTED MANUSCRIPT

AC CE P

TE

D

MA

NU

SC

RI

PT

142 is higher than the expected frequency because we allow intervening elements. For example, SSP 142 matches to the first βαβ unit and the C-terminal α-helix of the Rossmann-like fold. If intervening SSE insertions are not allowed, then SSP 142 is unobserved. Similarly, SSP 140 (left-handed βαββ) is also expected to be rare due to the left-handed connection. However, SSP 140 exhibits a high observed frequency, because it also matches the Rossmann-like fold by overlapping with the first βα hairpin, the β-strand after the crossover and the C-terminal α-helix. When intervening SSEs were prohibited, the frequency was zero. Adding any SSE to one side of the 3-stranded β-meander also results in overrepresented SSPs. S4 SSPs 61, 163, 60, and 162 are SSPs with an α-helix packed against the 3-stranded βmeander (Fig. 4). In fact, these SSPs with one α-helix packed against a 3-stranded β-meander have been described as favored SSPs in α+β folds43. In SSP 61, the added α-helix connects to the C-terminus of the β-meander. Typical examples of SSP 61 include β-barrels that are capped with an α-helix, such as PH domain-like and OB fold proteins. The same SSP is also adopted by proteins that bind DNA or RNA, such as the α+β class DNA-binding domain fold, the ssDNAbinding transcriptional regulator domain and the dsRNA-binding domain-like. SSP 60 is the mirror of SSP 61. SSP 60 is present in superfamilies such as the SufE/NifU domain, TATA-box binding protein-like domain and Metallo-hydrolase domain. For SSP 163, the added α-helix connects to the N-terminus of the β-meander. Examples of SSP 163 include the Glyoxalase, Mog1p/PsbP-like domain and the S-adenosylmethionine decarboxylase. Its mirror, SSP 162, is present in superfamilies such as the GAF domain-like protein, Arp2/3 complex domain and MesJ substrate recognition domain-like. By observed superfamily frequency, SSPs 61 and 163 rank higher than SSPs 60 and 162. SSPs 61 and 163 can be assembled by S3 SSP 10 with addition of an edge β-strand, while SSPs 60 and 162 grow upon S3 SSP 11 the same way. We noted that S3 SSP 10 ranks higher than S3 SSP 11. Therefore the difference between the observed S4 superfamily frequencies (of SSPs 61 and 163 vs. SSPs 60 and 162) is consistent with the frequencies of their constituent S3 SSPs. The ratio of observed frequency between SSP 61 and SSP 60 is around 1.7. Orengo et al. obtained a similar ratio and indicated that the orientation of SSP 61 is more common in βββα meander SSPs 43. If we replace the α-helix in SSPs 61 and163 with a β-strand, we obtain SSPs 3 and 7 (Fig. 4). The added β-strand is expected to hydrogenbond to additional β-strands to form another sheet. The largest superfamily containing SSPs 3 and 7 is the Immunoglobulin-like β-sandwich structures. A Greek key SSP is regarded as the signature of Jelly roll and Immunoglobulin folds44. The variants of Greek key SSP have high observed frequency. SSP 5 (Fig. 4) resembles the standard Greek key SSP characterized by Richardson4, where the first three β-strands form a βmeander and the fourth β-strand is connected by a long loop (strand order 3214). SSP 4 connects the N-terminal β-strand to a β-meander in the same way (strand order 1432). S4 SSPs 4 and 5 are overrepresented by observed superfamily frequencies (Fig 4.), and are most often found in highly populated β-barrel and β-sandwich folds, such as OB-fold barrel and Immunoglobulin-like betasandwich. The presence of a Greek key SSP in an open-faced β-sheet is rare34, although a few cases exist such as in the DNA mismatch repair protein PMS2 (1h7u, B: 118-161, β-strand 3456 14

ACCEPTED MANUSCRIPT

Analysis of S5 hits

TE

D

MA

NU

SC

RI

PT

of the 8-stranded N-terminal β-sheet). Hutchinson et al. classified Greek key SSPs and demonstrated that four Greek key β-strands are often split into two β-sheets16. S4 SSPs 14, 20 and 26 represent the (2,2) Greek key, (3,1)N Greek key and (3,1)C Greek key defined in Hutchinson’s classification. The notation of (2,2) dictates that two β-strands are in one sheet and the remaining two in the other; (3,1) dictates that three β-strands are in one sheet and the last one in the other; ‘N’ means the N-terminal of Greek key is on the outside of the SSP and ‘C’ means the C-terminal end of Greek key is on the outside. The Greek key SSPs from Hutchinson’s work with β-strands across the two sheets are also abundant; they serve as the cross connections in the Immunoglobulin fold and Jelly-roll-like structures. A 4-helical up-down bundle represents one of the most common SSPs61. Three variants of a 4-helical bundle are overrepresented in S4 SSPs, including SSPs 229, 232, and 233 (Fig. 4). They have similar expected and observed frequencies. SSP 233 represents a superhelical-like structure growing upon a 3-helical bundle. SSP 233 can be an independent domain such as the Cag-Z domain. SSPs 229 and 232 are mirrors, where SSP 229 is counterclockwise and SSP 232 is clockwise. Both of these two SSPs can be found as independent folding units such as Nickelcontaining superoxide dismutase protein for SSP 229 and a Bromodomain-like domain for SSP 232. The observed frequency of the clockwise version (SSP 232) is only slightly larger than the counterclockwise version (SSP 232), implying no preference for the swirl of the 4-helical bundle.

AC CE P

Rossmann-fold like SSPs rank the highest in S5 The β-meander has the highest frequencies in S3 and S4 SSPs; however, in S5, two Rossmannlike SSPs (SSP 806 and SSP 803, Fig. 5) rank at the very top. SSP 806 represents the smallest version of a Rossmann-fold and has the second highest observed frequency. SSP 803 represents the right-handed βαβαβ unit that is usually found in one half of a larger Rossmann-fold. SSP 803 has the highest observed frequency due to its presence in other α/β twist proteins, such as Tim barrel and SpoIIaa-like domains (βα superhelix structure). Compared to SSP 803 (right-handed βαβαβ), the observed frequency of SSP 1049 (right-handed αβαβα) is lower. A potential reason for this discrepancy is that SSP 1049 requires the presence of additional β-strands to stabilize its three α-helices and thus it is less independent than SSP 803 in α/β proteins. Therefore, it is more difficult to find SSP 1049 within α/β proteins with a smaller number of parallel β-strands such as the barstar-related proteins with βαβαβ SSE arrangement (e.g. pdb 1b2s chain E). Another example is the 3-layer flavoproteins with 5 parallel β-strands ordered 21345, where the second half of the protein (β-strands 345) can accommodate SSP 803 but not SSP 1049. There are two SSPs (SSP 793 and SSP 804, Fig. 5) with uncommon structural features (left-handedness and crossing loop). Nevertheless, they have relatively high observed frequencies due to their presence in the Rossmann-fold. SSP 793 represents a βαβαβ unit and the first βαβ is left-handed in the diagram. Since we do not limit SSE insertions between constituent SSEs, SSP 793 can overlap with the first βα hairpin of a Rossmann-fold and the βαβ unit after the Rossmann 15

ACCEPTED MANUSCRIPT

NU

SC

RI

PT

crossover. In SSP 804, the two connections cross over each other. However, when SSE insertion is allowed, SSP 804 can match to the first βαβα unit of Rossmann crossover and the initial βstrand after Rossmann crossover. The ferredoxin-like superfold contains several S5 SSPs (SSP 784 and SSP 567, Fig. 5). SSP 784 represents a partial ferredoxin-like fold (βαββαβ) and ranks third in observed frequency. SSP 567 can match to a permutation of a ferredoxin-like fold that starts from the second β-strand and connects the last β-strand to the first β-strand (e.g. C-terminal domain of arginine repressor).The S4 ferredoxin-like SSPs have higher observed frequencies than S4 Rossmann-like SSPs, whereas S5 ferredoxin-like SSPs rank lower than S5 Rossmann-like SSPs. In our opinion, the smaller S4 ferredoxin-like SSPs are frequent in the circular permutations of ferredoxin-like folds; however, when an SSE is added to yield the corresponding S5 SSP, the added SSE prevents the match in various permutations of the ferredoxin-like fold. 5-stranded Greek key SSPs are overrepresented in S5

AC CE P

TE

D

MA

A previous study surveyed 5-stranded Greek key SSPs in β-barrel and β-sandwich structures35, and differentiated the spatial variants of the 5-stranded Greek key SSPs based on the distribution of β-strands into two sheets. Several overrepresented S5 SSPs resemble 5-stranded Greek key SSPs. SSP 78 corresponds to the previously defined spatial form a0b0b1c0d0, where the first two and last two β-strands are termed a0, b0, c0 and d0 and the middle β-strand is termed b1 since the middle β-strand hydrogen-bonds to b0; SSP 142 corresponds to spatial form a0b0c1c0d0, since the middle β-strand hydrogen-bonds to c0. These two SSPs are usually located at the edge of βsandwich structures such as Carbohydrate-binding domain (SSP 78, pdb 2jh2 A:39-103) and thaumatin-like protein (SSP 142, pdb 1aun, chain A:2-51)62. The observed frequencies of SSPs 78 and 142 are similar, suggesting no preference for these two spatial forms. SSP 140 is also related to a Greek key SSP, which can be formed by adding one β-strand to the C-terminus of (3,1)C Greek key SSP (SSP 26 of S4). SSP 140 is present in the Immunoglobulin-like betasandwich fold.

DISCUSSION Split β-sheets are underrepresented SSPs with neighboring β-strands in sequence that are not hydrogen-bonded in structure are termed split β-sheets and are usually underrepresented. The β-strand arrangements of S3 SSPs 8 and 9 (Fig. 3) are known as psi-loops63. Each of the S3 psi-loop SSPs can only be assembled via one combination of smaller SSPs. Therefore, the expected frequency is half that of the 3-stranded β-meander. SSPs 8 and 9 are underrepresented. Their low observed frequencies may be explained by their relatively high local contact order, which has been suggested to result in significantly longer folding time64; 65. Moreover, the two outer psi-loop β-strands are connected by a loop crossing over the middle β-strand, which further raises the folding difficulty21. We observed only a few domains with psi-loop SSPs representing the complete β-sheet of structure 16

ACCEPTED MANUSCRIPT

AC CE P

TE

D

MA

NU

SC

RI

PT

core, such the asurocanase N-terminal domain, penicillin-binding protein 2x, and the HSP90 Cterminal domain. Larger β-sheets that extend the psi-loop, including the outlier S4 SSPs 43 and 52 (Fig. 4) and S5 SSPs 228, 287, and 316 (Fig. 5) are also underrepresented. There are additional ways to extend the psi-loop for S4 (and S5) such as adding a β-strand or α-helix on either side. For instance, when adding the α-helix to the N-terminus of a psi-loop with β-strand order 213, where the handedness of the α-helix and the first two β-strand is right-handed, we obtain an SSP that is present in the micothiol-dependent maleylpyruvate isomerase C-terminal domain-like superfamily with an αβ(2)(crossover)β(α)β topology and β-strand order 2134 (pdb 2nsf chain A:162-240). In contrast, when adding an α-helix to the opposite side, the SSP is found in the Nterminal domain of urocanase, with the topology of α(2)β(3), mixed β-sheet 213 (pdb 1uwl, chain B:61-106). The observed frequencies of such psi-loop extensions are low, which can be explained by their corresponding low expected frequencies. Comparing the outlier psi-loop SSPs (43 and 52) with the other psi-loop extensions that are crowded in the lower corner of the graph (Figure 4), the β-sheets for the outliers can be assembled via two combinations of smaller SSPs. For instance, either adding a β-strand to the N-terminus of the psi-loop or combining two βhairpins yield S4 SSP 43. However, the other extensions can be assembled in only one way. The S3 SSPs 7 and 4 (Fig. 3) represent two SSPs with a β-hairpin connected by a nonhydrogen-bonded β-strand. They rank at the very bottom in S3 observed frequency (Fig. 3), which is consistent with their low expected frequencies. The expected frequencies of SSPs 7 and 4 are low because these SSPs can only be formed by adding one β-strand to two unpaired βstrands, which has the lowest observed motif frequency in S2. The preference of local hydrogenbonding23 may explain the low observed frequency, since it is hard for SSP 7 and SSP 4 to compete with a β-meander in the folding process due to a lower number of hydrogen-bonding contacts. For SSPs 7 and 4, an additional β-strand is needed to stabilize the unpaired middle βstrand. Consequently, these SSPs are found in β-superhelical structures, where each turn is comprised of three β-strands with the middle β-strand hydrogen-bonded to the i+2 β-strand from the neighboring turn. In addition to infrequent psi-loops, S4 contains two other known types of split β-sheets, which have been termed pretzels (β-sheet in strand order 3142 and 2413) and spirals (β-sheet in strand order 1342 or 3124)6. Similar to previous studies that found pretzels and spirals to be absent from the structure database6, we do not observe these SSPs if we require the four βstrands to be adjacent in the primary protein sequence. However, if SSE insertions are allowed (our default mode), some hits occur. A pretzel of strand order 4132 is found in glyoxalase domains (e.g. pdb 1kmz chain A:3-79), where an αβ insertion is present between β-strand 1 and 2 and a loop connects β-strand 3 and 4. A spiral with strand order 1342 is found in the phosphotransferase/anion transport protein (e.g. pdb 1a3a chain A:11-106) with two helices connecting the first and the second β-strands.

17

ACCEPTED MANUSCRIPT Unobserved SSPs

AC CE P

TE

D

MA

NU

SC

RI

PT

The number of protein folds is limited due to topology preference, geometry regularities of SSE packing and chain topology66. Several general principles that govern the protein folding were summarized by Chothia42 and Taylor21. Their principles include: 1) right-handed connections are prevalent in βXβ units, where the two β-strands are hydrogen-bonded and X can be α-helix, βstrand or a longer connection comprised of several SSEs; 2) connections between SSEs tend not to cross each other or make knots; and 3) larger numbers of jumps6 (the number of β-strand pairs that are split in a β-sheet) are unfavorable. Our search found all SSPs from S1 to S4 to have at least one motif hit in the PDB. However, 218 S5 SSPs did not have a single motif hit in the current PDB (unobserved SSPs). These unobserved SSPs highlight an interesting theoretical question about fold space: are SSPs that are absent in nature able to fold? Among the unobserved SSPs, 155 have unpaired β-strand(s) and are thereby not considered as isolated domains, since they require additional elements to fold. Therefore, we focus on outlining the properties of the remaining 63 unobserved SSPs that could potentially serve as protein design targets. To characterize these 63 SSPs, we first grouped them by SSE arrangement and topology. We then investigated the violations of general protein folding rules for each of the resulting six groups. The diagrams of these SSPs are shown in Fig. 6. The eight unobserved 5-stranded β-sheets (Fig. 6, panels 1-8) all contain psi-loops. 6 Spirals (β-sheet in strand order 1342 or 3124) are also present in the SSPs shown in panel 1, 7 and 8 and pretzels20 (β-sheet in strand order 3142 and 2413) are in the SSPs shown in panel 3, 4, 5 and 6. The remaining SSP shown in panel 2 is similar to a pretzel, where the connection between β-strand 1 and 2 collides with the connection between β-strand 3 and 4. We counted the number of jumps (split connection) in each β-sheet as per Ruczinski et al.6. Out of the eight unobserved β-sheets, two of them contain two jumps and six of them contain three jumps. In comparison, the frequently observed β-sheet SSPs usually have a low number of jumps (e.g. the β meander has zero and the standard Greek key has one). Though these rules generally hold true for β-sheet-containing folds, exceptional cases were observed. For instance, we found five motif hits for a β-sheet with strand order 13542 and ↑↑↑↓↓ topology. This SSP has three jumps as well as a spiral formed by β-strands 2-5. The connection between β-strands 1 and 2 collides with the connection between β-strands 3 and 4. This SSP is present in the erythrocite membrane band 3 domain (pdb 1hyn, chain S), where both split connections contain inserted SSEs. This exceptional case implies we might eventually find the eight unobserved 5-stranded β-sheets in nature as the size of the library grows. We identified 26 unobserved SSPs with β-sandwich topologies (three β-strands in one sheet and two β-strands in another sheet). Each of the 26 SSPs (Fig. 6, panel 9-34) contains a rare left-handed βββ unit. Crossing loops are also present in all 26 SSPs. For instance, in panel 9, the connection between the second β-strand and the third as well as the connection between the fourth and the fifth form a crossing loop. Nine unobserved SSPs have a 4-stranded β-sheet flanked by an α-helix (Fig. 6, panels 3543). Eight of these SSPs (Fig. 6, panel 36-43) have two jumps in the β-sheet. Among these eight 18

ACCEPTED MANUSCRIPT

Limitations

AC CE P

TE

D

MA

NU

SC

RI

PT

SSPs, five contain a left-handed βαβ unit (Fig. 6, panel 37-39 and panel 42-43). The remaining SSP, which contains one jump, is shown in panel 35. However, this SSP also includes a rare lefthanded βαβ unit. Interestingly, its mirror (preserves the strand order but with right-handed βαβ connection) is present (in tryptophan hydroxylase,1mhwA, 318-376) and the long connection between the second and third β-strands is formed by helices. The combination of jumps and lefthandedness is present in six out of nine unobserved SSPs. Four unobserved SSPs form a 3-stranded β-sheet flanked by an α-helix on each side (Fig. 6, panels 44-47). Three of them contain crossing loops (Fig. 6, panels 45-47) and one has two left-handed βαβ units (panel 44). Eight unobserved SSPs form a 3-stranded β-sheet flanked by two helices on one side (Fig. 6, panels 48-55). Among them, seven SSPs (Fig. 6, panels 48-52 and panels 54-55) have both crossing loops and left-handed βαβ units. We also noted eight unobserved SSPs with a 2-stranded β-sheet flanked by three helices on one side (Fig. 6, panels 56-63). All of these SSPs have crossing loops, and five of them (panel 59-63) have left-handed units. In summary, unobserved SSPs frequently contain unfavorable structure features that include crossing loops, left-handed connections and split β-strand connections. The unobserved S5 SSPs often contain two or more unfavorable features (Table 5). 218 S5 SSPs are unobserved. 63 of these unobserved SSPs (that do not have unpaired β-strand), 30 (48%) contain both a jump and crossing loop, 34 (54%) contain both a jump and left-handed βxβ unit and 39 (62%) contain both a crossing loop and left-handed βxβ unit. Only 14 SSPs out of 63 SSPs (22%) contain exactly one unfavorable feature; these SSPs might be more likely in proteins. Note that 14 is only 6.4% of 218.

We sought to enumerate the globular compact packing of linear SSEs in proteins, with the goal to establish the maximal coverage of the existing folds, while limiting the total number of enumerated SSPs and the search time (of SSPs in PDB). To accommodate these criteria, several constraints were built into our SSP generation and search methods. These constraints include: 1) requiring consecutive SSEs to travel in an up-down topology, which reflects the fact that two neighboring parallel (as opposed to up-down) SSEs are much more frequently connected via a helix or a strand rather than a loop, and such connecting loops, if present, may be deteriorated or partly unfolded helices and β-strands; 2) rejecting SSPs with three or more coplanar helices, which are infrequent in proteins due to irregular arrangement of helices, some of which usually deviate from the plane; 3) favoring antiparallel/parallel interactions between SSEs over orthogonal packing, where we consider two SSEs to be orthogonal if the angle between them lies in the narrow range of 85°≤ϕ<95°; 4) searching for SSPs among PDB chains, which ignores motif hits formed by multimers and reduces the search space; 5) imposing a minimum length for an α-helix (8 residues) and a β-strand (5 residues) in order to focus on regular and long SSEs, which minimizes the effect of SSE delineation errors and the influence of irregular structures on 19

ACCEPTED MANUSCRIPT

AC CE P

TE

D

MA

NU

SC

RI

PT

the results; and 6) excluding the compositions of non-compact SSPs to yield a compact SSPs. Although these constraints may exclude certain SSP hits, as reflected by the PDB coverage (Table 1), our SSPs identify the vast majority of structures (S5 coverage 87.7 %). To illustrate the limitations of these constraints, we show several structures that exist in nature but were not identified by S3 SSPs. HMG-box domain (pdb 1ckt, chain A) represents a violation of the up-down topology requirement (Fig. 7a). HMG-box domain consists of three helices in an up-down-down topology, with the first α-helix pointing up (shown in blue), followed by two helices pointing down (show in green and red). A similar example can be found in the SARS ORF9b-like domain (pdb 2cme, chain B) all-β structure (Fig.7b), in which the three β-strands are arranged in strand order 123, with up-down-down topology. Moreover, the length of the α-helix (shown in orange) is shorter than eight residues. Therefore, the SSP formed by the blue and green β-strands and the orange αhelix cannot be identified. Note that this protein chain forms a dimer with another chain in the PDB through a cross-chain β-sheet. An example of coplanar helices is shown in Fig. 7c, which depicts a 3-helical GRIP domain (pdb 1upt, chain B). This 3-helical GRIP domain requires a homodimeric interaction for stability. The complete protein is formed by two chains, with each contributing a collinear 3-helical domain. The GRIP domain similarly highlights our omission of eligible S3 SSPs that could be identified across chains. An example of a domain that is omitted by S3 due to orthogonal packing is shown in Fig. 7d, where the second α-helix (shown in green) in the IscX domain (pdb 1uj8, chain A, helix-turn-helix) is perpendicular to the first α-helix (shown in blue). An example of an omitted protein due to its short SSEs is shown in Fig. 7e, which is an orphan nuclear receptor Rev-erb DNA-binding domain (pdb 1a6y, chain A) with two short (<5 residues) β-strands (shown in blue).We also observed multiple domain proteins that are missing among our SSP hits. One such example is illustrated in Fig. 7f, which is a chain of the viral NSP3 homodimer protein (pdb 1knz, chain A) that contains two SCOP domains. The first five helices form an all-α domain, and the rest of the SSEs form an α+β domain. In the all-α domain, only blue, green and yellow helices may form an up-down-up SSP. However, the yellow α-helix cannot interact with the other two helices (too far apart); thus, such an SSP is infeasible. Similarly, in the α+β domain, the two β-strands and the following α-helix may form an SSP with up-down-up topology. However, there is no interaction between the second β-strand (shown in orange) and the α-helix (shown in light red) per our distance calculation; thus, such an SSP is infeasible.

CONCLUSIONS The analysis of super-secondary structure patterns (SSPs)9; 22; 28; 67; 68; 69 has contributed to structure prediction and protein design. We novelly enumerated globular compact SSPs composed of up to five SSEs (helices or strands). Our enumeration model builds larger SSPs by combining two smaller SSPs, which emulates the idea of fold evolution, i.e. combination of smaller structures to form larger ones. A significant proportion of existing proteins in the PDB 20

ACCEPTED MANUSCRIPT

MA

NU

SC

RI

PT

and SCOP is covered by at least one SSP. Our search results will aid structure classification, because they provide lists of proteins containing a specific SSP as a seed for more detailed structure analysis. This strategy to start search from a core SSP was previously employed in classifying Thioredoxin-like folds70. The analysis of the distribution of SSPs in SCOP superfamilies demonstrates that overrepresented SSPs are present in superfolds, such as the Rossmann-like fold, ferredoxin-like fold and Greek key motif. This agrees with the previous finding that a superfold is made of frequent supersecondary structures13. The rare SSPs exhibit uncommon structure features, e.g. split β-sheet connection, left-handed connection and crossing loop. We identified several SSPs, such as the pretzel β-sheet and spiral β-sheet, which to our knowledge were undiscovered in protein structures. Moreover, all possible SSPs composed of two through four SSEs are present in the PDB, whereas 63 SSPs with five SSEs are unobserved in proteins. Among these 63 unobserved SSPs without unpaired β-strands, we saw 49 (i.e. 78%) SSPs containing two or more unfavorable features indicating high co-occurrence of uncommon features. The remaining 14 SSPs with only one unfavorable feature are of interest. We provide a library of SSPs from which one can select SSPs based on their observed and expected frequencies to design proteins, including those with novel folds.

D

Website

AC CE P

TE

We tabulate the SSP search results online at http://prodata.swmed.edu/ssps/. This website provides a table that is sortable by SSP identity, the number of motif hits and the number of superfamily hits, along with the corresponding observed and expected frequencies. Links corresponding to the number of motif hits will navigate the user to an individual page that displays more detailed search results, such as the SCOP domain id, range of each motif hit and the corresponding superfamily assignment. The motif hits can be visualized with PyMOL linked to the motif hit range.

ACKNOWLEDGEMENTS

This work was supported in part by the National Institutes of Health (GM094575 to NVG) and the Welch Foundation (I-1505 to NVG). Daniel Parente contributed to the coding of the initial model. Hua Cheng scrutinized some of the motif hits. Raquel Bromberg assisted with the final reading.

REFERENCES 1. 2. 3.

Brennan, R. G. & Matthews, B. W. (1989). The helix-turn-helix DNA binding motif. J Biol Chem 264, 1903-6. Taylor, W. R. & Thornton, J. M. (1984). Recognition of super-secondary structure in proteins. J Mol Biol 173, 487-512. Richardson, J. S. (1977). beta-Sheet topology and the relatedness of proteins. Nature 268, 495-500.

21

ACCEPTED MANUSCRIPT

11. 12.

13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28.

PT

RI

SC

10.

NU

9.

MA

8.

D

7.

TE

6.

Richardson, J. S. (1981). The anatomy and taxonomy of protein structure. Adv Protein Chem 34, 167-339. Efimov, A. V. (2013). Super-secondary structures and modeling of protein folds. Methods Mol Biol 932, 177-89. Ruczinski, I., Kooperberg, C., Bonneau, R. & Baker, D. (2002). Distributions of beta sheets in proteins with application to structure prediction. Proteins 48, 85-97. Sibanda, B. L., Blundell, T. L. & Thornton, J. M. (1989). Conformation of beta-hairpins in protein structures. A systematic classification with applications to modelling by homology, electron density fitting and protein engineering. J Mol Biol 206, 759-77. Menon, V., Vallat, B. K., Dybas, J. M. & Fiser, A. (2013). Modeling proteins using a super-secondary structure library and NMR chemical shift information. Structure 21, 891-9. Jones, D. T. & McGuffin, L. J. (2003). Assembling novel protein folds from super-secondary structural fragments. Proteins 53 Suppl 6, 480-5. Burroughs, A. M., Balaji, S., Iyer, L. M. & Aravind, L. (2007). A novel superfamily containing the betagrasp fold involved in binding diverse soluble ligands. Biol Direct 2, 4. Krishna, S. S., Majumdar, I. & Grishin, N. V. (2003). Structural classification of zinc fingers: survey and summary. Nucleic Acids Res 31, 532-50. Lupas, A. N., Ponting, C. P. & Russell, R. B. (2001). On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? J Struct Biol 134, 191-203. Salem, G. M., Hutchinson, E. G., Orengo, C. A. & Thornton, J. M. (1999). Correlation of observed fold frequency with the occurrence of local structural motifs. J Mol Biol 287, 969-81. Rao, S. T. & Rossmann, M. G. (1973). Comparison of super-secondary structures in proteins. J Mol Biol 76, 241-56. Burroughs, A. M., Balaji, S., Iyer, L. M. & Aravind, L. (2007). Small but versatile: the extraordinary functional and structural diversity of the beta-grasp fold. Biol Direct 2, 18. Hutchinson, E. G. & Thornton, J. M. (1993). The Greek key motif: extraction, classification and analysis. Protein Eng 6, 233-45. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997). CATH--a hierarchic classification of protein domain structures. Structure 5, 1093-108. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247, 536-40. Richardson, J. S. (1976). Handedness of crossover connections in beta sheets. Proc Natl Acad Sci U S A 73, 2619-23. Cohen, F. E., Sternberg, M. J. & Taylor, W. R. (1982). Analysis and prediction of the packing of alphahelices against a beta-sheet in the tertiary structure of globular proteins. J Mol Biol 156, 821-62. Grainger, B., Sadowski, M. I. & Taylor, W. R. (2010). Re-evaluating the "rules" of protein topology. J Comput Biol 17, 1371-84. Koga, N., Tatsumi-Koga, R., Liu, G., Xiao, R., Acton, T. B., Montelione, G. T. & Baker, D. (2012). Principles for designing ideal protein structures. Nature 491, 222-7. Baldwin, R. L. & Rose, G. D. (1999). Is protein folding hierarchic? I. Local structure and peptide folding. Trends Biochem Sci 24, 26-33. Baldwin, R. L. & Rose, G. D. (1999). Is protein folding hierarchic? II. Folding intermediates and transition states. Trends Biochem Sci 24, 77-83. Efimov, A. V. (1994). Favoured structural motifs in globular proteins. Structure 2, 999-1002. Baker, D. (2000). A surprising simplicity to protein folding. Nature 405, 39-42. Bystroff, C. & Shao, Y. (2002). Fully automated ab initio protein structure prediction using I-SITES, HMMSTR and ROSETTA. Bioinformatics 18 Suppl 1, S54-61. Fernandez-Fuentes, N., Oliva, B. & Fiser, A. (2006). A supersecondary structure library and search algorithm for modeling loops in protein structures. Nucleic Acids Res 34, 2085-97.

AC CE P

4. 5.

22

ACCEPTED MANUSCRIPT

36. 37. 38. 39.

40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53.

PT

RI

SC

34. 35.

NU

33.

MA

32.

D

31.

TE

30.

Fernandez-Fuentes, N., Zhai, J. & Fiser, A. (2006). ArchPRED: a template based loop structure prediction server. Nucleic Acids Res 34, W173-6. Janin, J. & Chothia, C. (1980). Packing of alpha-helices onto beta-pleated sheets and the anatomy of alpha/beta proteins. J Mol Biol 143, 95-128. Chothia, C. & Janin, J. (1981). Relative orientation of close-packed beta-pleated sheets in proteins. Proc Natl Acad Sci U S A 78, 4146-50. Chothia, C. & Janin, J. (1982). Orthogonal packing of beta-pleated sheets in proteins. Biochemistry 21, 3955-65. Murzin, A. G. & Finkelstein, A. V. (1988). General architecture of the alpha-helical globule. J Mol Biol 204, 749-69. Zhang, C. & Kim, S. H. (2000). The anatomy of protein beta-sheet topology. J Mol Biol 299, 1075-89. Zhang, C. & Kim, S. H. (2000). A comprehensive analysis of the Greek key motifs in protein beta-barrels and beta-sandwiches. Proteins 40, 409-19. Chiang, Y. S., Gelfand, T. I., Kister, A. E. & Gelfand, I. M. (2007). New classification of supersecondary structures of sandwich-like proteins uncovers strict patterns of strand assemblage. Proteins 68, 915-21. Woolfson, D. N., Evans, P. A., Hutchinson, E. G. & Thornton, J. M. (1993). Topological and stereochemical restrictions in beta-sandwich protein structures. Protein Eng 6, 461-70. Fokas, A. S., Gelfand, I. M. & Kister, A. E. (2004). Prediction of the structural motifs of sandwich proteins. Proc Natl Acad Sci U S A 101, 16780-3. Fokas, A. S., Papatheodorou, T. S., Kister, A. E. & Gelfand, I. M. (2005). A geometric construction determines all permissible strand arrangements of sandwich proteins. Proc Natl Acad Sci U S A 102, 15851-3. Papatheodorou, T. S. & Fokas, A. S. (2009). Systematic construction and prediction of the arrangement of the strands of sandwich proteins. J R Soc Interface 6, 63-73. Efimov, A. V. (1997). Structural trees for protein superfamilies. Proteins 28, 241-60. Chothia, C. & Finkelstein, A. V. (1990). The classification and origins of protein folding patterns. Annu Rev Biochem 59, 1007-39. Orengo, C. A. & Thornton, J. M. (1993). Alpha plus beta folds revisited: some favoured motifs. Structure 1, 105-20. Bork, P., Holm, L. & Sander, C. (1994). The immunoglobulin fold. Structural classification, sequence patterns and common core. J Mol Biol 242, 309-20. Thornton, J. M., Orengo, C. A., Todd, A. E. & Pearl, F. M. (1999). Protein folds, functions and evolution. J Mol Biol 293, 333-42. Orengo, C. A., Jones, D. T. & Thornton, J. M. (1994). Protein superfamilies and domain superfolds. Nature 372, 631-4. Shi, S., Zhong, Y., Majumdar, I., Sri Krishna, S. & Grishin, N. V. (2007). Searching for three-dimensional secondary structural patterns in proteins with ProSMoS. Bioinformatics 23, 1331-8. Majumdar, I., Krishna, S. S. & Grishin, N. V. (2005). PALSSE: a program to delineate linear secondary structural elements from protein structures. BMC Bioinformatics 6, 202. Shi, S., Chitturi, B. & Grishin, N. V. (2009). ProSMoS server: a pattern-based search using interaction matrix representation of protein structures. Nucleic Acids Res 37, W526-31. Soding, J. & Lupas, A. N. (2003). More than the sum of their parts: on the evolution of proteins from peptides. Bioessays 25, 837-46. Kinch, L. N. & Grishin, N. V. (2002). Evolution of protein structures and functions. Curr Opin Struct Biol 12, 400-8. Peng, K., Obradovic, Z. & Vucetic, S. (2004). Exploring bias in the Protein Data Bank using contrast classifiers. Pac Symp Biocomput, 435-46. Sibanda, B. L. & Thornton, J. M. (1985). Beta-hairpin families in globular proteins. Nature 316, 170-4.

AC CE P

29.

23

ACCEPTED MANUSCRIPT

61. 62. 63. 64. 65. 66. 67.

68.

69. 70.

PT

RI

SC

60.

NU

59.

MA

58.

D

57.

TE

55. 56.

Caetano-Anolles, G. & Caetano-Anolles, D. (2003). An evolutionarily structured universe of protein architecture. Genome Res 13, 1563-71. Grishin, N. V. (2001). Fold change in evolution of protein structures. J Struct Biol 134, 167-85. Kajava, A. V. (1992). Left-handed topology of super-secondary structure formed by aligned alpha-helix and beta-hairpin. FEBS Lett 302, 8-10. Cole, B. J. & Bystroff, C. (2009). Alpha helical crossovers favor right-handed supersecondary structures by kinetic trapping: the phone cord effect in protein folding. Protein Sci 18, 1602-8. Chou, K. C., Nemethy, G., Pottle, M. & Scheraga, H. A. (1989). Energy of stabilization of the right-handed beta alpha beta crossover in proteins. J Mol Biol 205, 241-9. Dym, O. & Eisenberg, D. (2001). Sequence-structure analysis of FAD-containing proteins. Protein Sci 10, 1712-28. Rossmann, M. G., Moras, D. & Olsen, K. W. (1974). Chemical and biological evolution of nucleotidebinding protein. Nature 250, 194-9. Presnell, S. R. & Cohen, F. E. (1989). Topological distribution of four-alpha-helix bundles. Proc Natl Acad Sci U S A 86, 6592-6. Efimov, A. V. (1982). [Super-secondary structure of beta-proteins]. Mol Biol (Mosk) 16, 799-806. Hutchinson, E. G. & Thornton, J. M. (1996). PROMOTIF--a program to identify and analyze structural motifs in proteins. Protein Sci 5, 212-20. Plaxco, K. W., Simons, K. T. & Baker, D. (1998). Contact order, transition state placement and the refolding rates of single domain proteins. J Mol Biol 277, 985-94. Bonneau, R., Ruczinski, I., Tsai, J. & Baker, D. (2002). Contact order and ab initio protein structure prediction. Protein Sci 11, 1937-44. Govindarajan, S., Recabarren, R. & Goldstein, R. A. (1999). Estimating the total number of protein folds. Proteins 35, 408-14. Gerstman, B. S. & Chapagain, P. P. (2013). Computational simulations of protein folding to engineer amino acid sequences to encourage desired supersecondary structure formation. Methods Mol Biol 932, 191-204. Sborgi, L., Verma, A., Sadqi, M., de Alba, E. & Munoz, V. (2013). Protein folding at atomic resolution: analysis of autonomously folding supersecondary structure motifs by nuclear magnetic resonance. Methods Mol Biol 932, 205-18. Pellegrini-Calace, M., Carotti, A. & Jones, D. T. (2003). Folding in lipid membranes (FILM): a novel method for the prediction of small membrane protein 3D structures. Proteins 50, 537-45. Qi, Y. & Grishin, N. V. (2005). Structural classification of thioredoxin-like fold proteins. Proteins 58, 37688.

AC CE P

54.

24

Figure 1

AC CE P

TE

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

25

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

AC CE P

TE

D

MA

Figure 2

26

Figure 3

AC CE P

TE

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

27

AC CE P

TE

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

Figure 4

28

AC CE P

TE

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

Figure 5

29

AC CE P

TE

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

Figure 6

30

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

AC CE P

TE

D

Figure 7

31

AC CE P

TE

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

Graphical abstract

32

ACCEPTED MANUSCRIPT

Table 1. Enumerated SSPs and their occurrence in PDB Percentage 4 of pdbs

Percentage of SCOP 5 domains

2 3 4 5

5 23 221 1239

0 0 0 218 (18%)

99.5% 97.0% 92.9% 87.7%

99.0% 95.2% 90.0% 83.9%

1

Percentage of pbds with ≥50% SSEs covered by 6 SSPs 99.1% 92.1% 85.6% 71.8%

Percentage of SCOP 7 superfamilies

PT

Unobserved 3 SSPs

97.5% 90.6% 83.5% 73.1%

RI

Total number of 2 SSPs

SC

SSP 1 Size

The number of constituent SSEs defining the SSP The number of SSPs generated by our enumeration model 3 The number of SSPs that are absent in the PDB. 4 Given a set of SSPs with size n, the number of pdbs found by at least one SSP divided by the total number of pdbs that have at least n SSEs. 5 Given a set of SSPs with size n, the number of SCOP domains found by at least one SSP divided by the total number of SCOP domains that have at least n SSEs. 6 The number of pdbs that have over 50% of their SSEs matched divided by the total number of pdbs found. 7 The number of SCOP superfamilies that have half or more of their pdbs covered by at least one SSP divided by the total number of superfamilies (that contain at least one SSP of the given SSP set).

AC CE P

TE

D

MA

NU

2

33

ACCEPTED MANUSCRIPT

Table 2. The distribution of α-helix and β-strand in four major SCOP classes. Number of superfamilies containing αα-helix

Ratio* of αα-helix present in SCOP class

allall-α allall-β α/β α+β

507 354 244 552

507 221 244 546

1.00 0.62 1.00 0.99

Number of superfamilies containing ββstrand 65 352 243 545

PT

Number of superfamilies

RI

SCOP Class

Ratio* of ββstrand present in SCOP class 0.13 0.99 1.00 0.99

Table 3. S2 observed and expected frequencies. 2

Specific 2 Expected motif frequency 17.8% 16.2% 24.1% 24.1% 17.8%

EE (H(H-bonded) HH EH HE EE (not HH-bonded)

12.5% 25% 25% 25% 12.5%

1

D

SSP

Observed motif frequency

MA

1

Nonspecific 1 expected frequency

NU

SC

*ratios are obtained by dividing the number of superfamilies containing α-helix (β-strand) by the total number of superfamilies in a SCOP class.

22.7%* 15.7%* 25.0% 22.5%* 14.1%

2

Specific 2 Expected superfamily frequency 14.6% 21.1% 24.8% 24.8% 14.6%

Observed superfamily frequency 24.7%* 23.7% 20.6%* 20.6%* 10.5%

AC CE P

TE

calculated using purely theoretical expected frequencies, which does not rely on the observed frequency and applies to motif or superfamily. 2 calculated using observation-based expected frequencies. * SSPs with smaller difference between the specific expected frequency and the observed frequency than that between the nonspecific expected frequency and the observed frequency

34

ACCEPTED MANUSCRIPT

Table 4. The distribution of S2 SSPs in four major SCOP classes. Number of superfamilies

Number of superfamilies containing containing EE

Number of superfamilies containing HH

Number of superfamilies containing EH

Number of superfamilies containing HE

allall-α allall-β α/β α+β

507 354 244 552

58 (0.11) 347 (0.98) 162 (0.66) 541 (0.98)

477 (0.94) 47 (0.13) 174 (0.71) 388 (0.70)

66 (0.13) 150 (0.42) 242 (0.99) 508 (0.92)

64 (0.13) 161 (0.45) 241 (0.99) 500 (0.91)

RI

SC Number of SSPs with both crossing loop and leftlefthanded βxβ unit

Number of SSPs with three uncommon uncommon structure features.

Number of SSPs containing two or more uncommon structure features

0

0

0

0

0

0

26

22

22

26

13

26

4-stranded ββ-sheet flanked by an αα-helix (9 SSPs)

ThreeThree-stranded ββ-sheet flanked by two αα-helices on both sides (4 SSPs) ThreeThree-stranded ββ-sheet flanked by two αα-helices on one side (8 SSPs) TwoTwo-stranded ββ-sheet flanked by three ααhelices on one side (8 SSPs) Total SSPs (63)

MA

Number of SSPs with both jump and leftlefthanded βxβ unit

D

22

0

Number of SSPs with both jump and crossing loop

26

AC CE P

β-sandwich with three β-strands in one sheet and two ββ-strands in another sheet (26 SSPs)

8

Number of SSPs with a leftlefthanded βxβ unit

TE

FiveFive-stranded ββ-sheet (8 SSPs)

Number of SSPs with a crossing loop

NU

Table 5. Unobserved SSPs Statistics. Number of SSPs with a jump

Number of superfamilies containing split EE 7 (0.01) 254 (0.72) 66 (0.27) 138 (0.25)

PT

SCOP Class

9

0

6

0

6

0

0

6

4

3

2

3

2

1

1

4

5

8

7

5

4

7

4

8

0

8

5

0

0

5

0

5

48 (76%)

45 (71%)

46 (73%)

30 (48%)

34 (54%)

39 (62%)

18 (28%)

49 (78%)

35