Recurring sequence-structure motifs in (βα)8-barrel proteins and experimental optimization of a chimeric protein designed based on such motifs

Recurring sequence-structure motifs in (βα)8-barrel proteins and experimental optimization of a chimeric protein designed based on such motifs

    Recurring sequence-structure motifs in (βα) 8 -barrel proteins and experimental optimization of a chimeric protein designed based on ...

2MB Sizes 15 Downloads 50 Views

    Recurring sequence-structure motifs in (βα) 8 -barrel proteins and experimental optimization of a chimeric protein designed based on such motifs Jichao Wang, Tongchuan Zhang, Ruicun Liu, Meilin Song, Juncheng Wang, Jiong Hong, Quan Chen, Haiyan Liu PII: DOI: Reference:

S1570-9639(16)30228-X doi:10.1016/j.bbapap.2016.11.001 BBAPAP 39848

To appear in:

BBA - Proteins and Proteomics

Received date: Revised date: Accepted date:

1 August 2016 4 November 2016 6 November 2016

Please cite this article as: Jichao Wang, Tongchuan Zhang, Ruicun Liu, Meilin Song, Juncheng Wang, Jiong Hong, Quan Chen, Haiyan Liu, Recurring sequencestructure motifs in (βα)8 -barrel proteins and experimental optimization of a chimeric protein designed based on such motifs, BBA - Proteins and Proteomics (2016), doi:10.1016/j.bbapap.2016.11.001

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Recurring sequence-structure motifs in (βα)8-barrel

T

proteins and experimental optimization of a chimeric

IP

protein designed based on such motifs

a

SC R

Jichao Wanga, Tongchuan Zhanga, Ruicun Liua, Meilin Songa, Juncheng Wanga, Jiong Honga, Quan Chena,*, Haiyan Liua,b,c,* School of Life Sciences, University of Science and Technology of China, Hefei, Anhui, 230027,

NU

China

Hefei National Laboratory for Physical Sciences at the Microscale, Hefei, Anhui, 230027, China

c

Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, Anhui, 230031, China

MA

b

D

Abstract

TE

An interesting way of generating novel artificial proteins is to combine sequence motifs from natural proteins, mimicking the evolutionary path suggested by natural

CE P

proteins comprising recurring motifs. We analyzed the  and  modules of TIM barrel proteins by structure alignment-based sequence clustering. A number of preferred motifs were identified. A chimeric TIM was designed by using recurring

AC

elements as mutually compatible interfaces. The foldability of the designed TIM protein was then significantly improved by six rounds of directed evolution. The melting temperature has been improved by more than 20 ℃ . A variety of characteristics suggested that the resulting protein is well-folded. Our analysis provided a library of peptide motifs that is potentially useful for different protein engineering studies. The protein engineering strategy of using recurring motifs as interfaces to connect partial natural proteins may be applied to other protein folds.

Keywords: common secondary structure unit; chimeric protein design; directed evolution;

1

ACCEPTED MANUSCRIPT 1. Introduction Strategies to create artificial proteins with desired structures and/or functions are

T

of sustained interest for basic understanding of protein and protein engineering. De

IP

novo protein design holds great promises while exciting progresses have been made

SC R

(Dahiyat and Mayo, 1997; Huang et al., 2016; Koga et al., 2012; Kuhlman et al., 2003; Xiong et al., 2014). However, currently the success rate of de novo design is still limited, with so far few examples of success for relatively large proteins(Li et al.,

NU

2013). On the other hand, analysis of the sequence and structure organizations of natural proteins pointed to alternative approaches to engineering artificial

MA

proteins(Blaber and Lee, 2012; Broom et al., 2012; Hocker et al., 2001; Nikkhah et al., 2006; Yadid and Tawfik, 2007). It has been suggested that larger proteins may contain

D

smaller sequence/structure units that could be considered as core motifs, whose

TE

duplication, recombination or extension may lead to rapid generation of new proteins during evolution (Soding and Lupas, 2003; Tomii et al., 2012). In a number of studies,

CE P

well-folded engineered proteins have been created with repetitive motifs (Broom et al., 2012; Nikkhah et al., 2006; Yadid and Tawfik, 2007) or by extending a core motif(Watanabe et al., 2014). More recently, Jacobs et al. proposed a computational

AC

algorithm to construct distinct protein structures by merging motifs of multiple secondary structure units from native proteins(Jacobs et al., 2016). Thus it is highly interesting to investigate recurring sequence-structure motifs in natural proteins, and to explore their applications in protein engineering besides their implications for protein function and evolution. An interesting target for the systematic analysis of sequence and structural motifs is the (βα)8-barrel protein fold. It is also called the triosephosphate isomerase fold or TIM barrel fold. This fold is the most common one adopted by natural enzymes(Brändén, 1991). Despite their high structural similarity, members of the fold show extensive diversity in sequences and functions(Nagano et al., 2002). As an intact domain, the TIM barrel is formed by eight tandemly arranged (βα) units. Its modular organization at the subdomain level has attracted substantial attentions because of 2

ACCEPTED MANUSCRIPT implications for the evolutionary origin of new protein domains(Hocker et al., 2001; Lang et al., 2000; Richter et al., 2010) as well as for protein design(Eisenbeis et al., 2012; Fortenberry et al., 2011; Huang et al., 2016; Soberon et al., 2004). Lang et al.

IP

T

compared the atomic structures of two TIM-barrel proteins, HisA and HisF, which revealed that two-fold gene duplication and gene fusion from a common half-barrel

SC R

ancestor led to the complete barrels of these enzymes(Lang et al., 2000). Hoker et al. went on to demonstrate that the half barrels in HisF form independent stable folding units (Hocker et al., 2001). Later on, Richter et al. suggested that a (βα)2 quarter barrel

NU

unit predecessor might fuse twice to yield the extant (βα)8 barrels based on computational and experimental evidence(Richter et al., 2010). More recently,

MA

Farías-rico et al. found high sequence and structure similarity of certain (βα)2 segments of TIM-barrel proteins with (βα)2 segments in flavodoxin-like fold

D

domains(Farias-Rico et al., 2014). They also detected a family of sequences showing

TE

intermediate features between the two folds and determined the structure of one member of this family, which confirmed the cross fold conservation of the structure of

CE P

the (βα)2 unit(Farias-Rico et al., 2014). A variety of protein design and engineering strategies have been reported and tested on TIM-barrel proteins, such as sequence consensus-based design(Sullivan et

AC

al., 2011) or catalytic migration(Saab-Rincon et al., 2012). Of particular relevance here are those engineering studies based on protein fragments, which could be viewed as inspired by the modular organization of natural TIM barrels. Notably, Eisenbeis et al. reconstituted an intact, well-folded artificial eight-unit barrel from fragments with different folds, one fragment from the TIM-barrel protein HisF and the other from a flavodoxin-like fold protein CheY(Eisenbeis et al., 2012). Fortenberry et al. reported the successful engineering of a perfectly symmetric variant of HisF (Fortenberry et al., 2011). Insights from these previous studies suggest that it is worthwhile to systematically identify and compile recurring sequence and structure elements in TIM barrel proteins. In addition, it would also be interesting to explore new ideas that exploit these elements for protein engineering. In the current study, we applied a structure 3

ACCEPTED MANUSCRIPT alignment-based

sequence

clustering

approach

to

analyze

such

elements

systematically. Most previous systematic analyses of TIM-barrels are focused on the comparisons between full-length sequences and structures. Some interesting local

IP

T

motifs can be identified from this approach, such as the phosphate binding motif emphasized by Nagano et al.(Nagano et al., 2002). However, we expect that a

SC R

systematic fragment-centric analysis may provide a more complete picture about the types and distributions of local motifs than the full-length-centric studies. We retrieved the  and  fragments from a set of non-redundant TIM-barrel proteins as

NU

the basic fragments to be clustered. This fragmentation scheme allows elements of variable lengths to be extracted, differing from approaches considering contiguous

MA

fragments of fixed lengths. In addition, several adjustments to the usual dynamic-programing-based structure alignment approach were applied to make sure

D

that the alignments between fragments are always compatible with the alignments

TE

between complete barrels. After identifying the most frequent  and  elements, longer elements were identified through their extensions.

CE P

Finally, we tested an engineering strategy of using recurring elements as conserved interfaces or recombination sites between partial domains of natural TIM barrels. This strategy is different from that adopted in previous engineering efforts in

AC

which recurring elements were used mainly as components for recombination (Broom et al., 2012; Eisenbeis et al., 2012; Nikkhah et al., 2006; Yadid and Tawfik, 2007). Although an initial chimeric protein constructed with this strategy does not show ideal properties as a stable, well-folded monomeric globular domain, its folding properties can be greatly improved by an in vivo directed evolution approach optimizing protein stability and foldability(Foit and Bardwell, 2013; Foit et al., 2009). After six rounds of evolution, the foldability of the engineered protein is significantly improved according to various experimental evidence, especially the melting temperature is improved by more than 20 ℃.

4

ACCEPTED MANUSCRIPT 2. Materials and Methods 2.1 Computational analysis of recurring motifs

T

(1) Defining basic fragments

IP

Proteins of the ()8-barrel fold in the Structural Classification of Proteins(SCOP)

SC R

database(Murzin et al., 1995) were collected and only one family member was selected for each protein family. The dataset was further purged to eliminate proteins of above 25% pair-wise sequence identity. This led to a final set containing 108

NU

proteins. The 25% sequence identity requirement served to avoid possible residual redundancies in the dataset and affected the actual dataset size only minimally. The

MA

Protein Data Bank (PDB) (Sussman et al., 1998) IDs are given in Supplementary Table 1. Secondary structures in these proteins were initially assigned automatically

D

with the STRIDE program(Heinig and Frishman, 2004). To avoid inaccurate assignments caused by structural variations, especially to obtain a set of assignments

TE

in which the boundaries of the  strands forming the barrels were consistently defined

CE P

in all proteins, a number of the automatically assigned boundaries of the secondary structure elements were manually adjusted, so that the starting residues of the  strands fall approximately into the same layer around the barrel. In addition, for the

AC

first residue of every  strand, its side chain should point toward the interior of the barrel, and its backbone carbonyl should form a hydrogen bond to a backbone amide of the previous  strand. Based on the secondary structure assignments, two sets of super secondary structure fragments were extracted from the intact proteins. The first set comprised all the  fragments (referred to as HE fragments). The second comprised the  fragments (referred to as EH fragments). Each fragment included two sequentially adjacent segments in regular secondary structures (helix or sheet) and the loop connecting them. In what follows, a fragment will be identified by its type (“HE” or “EH”), its source PDB ID and its sequential number (ranged 1 to 8) in the barrel. We note that all non-terminal secondary structure segments are included twice in the fragment sets, once in the HE set and once in the HE set. In total 769  fragments 5

ACCEPTED MANUSCRIPT and 769  fragments were obtained. (2) Discovering recurring motifs

T

Structure alignment. By usual standards, the fragments within each of the sets are

IP

structurally highly similar to each other. Here we looked at if they can be further

SC R

distinguished into subsets according to their structural similarities. For this purpose, we carried out pair-wise structural alignments with the double dynamic programming (Toh, 1997) algorithm. The following adjustments were made to the algorithm, so that

NU

the resulting alignments based on the local fragments are compatible with alignments based on the global ()8 barrel. Firstly, the same number of (best-aligned) residues

MA

from the helix and from the strand segments are used for structure superposition during double dynamic programming. This is to avoid dominating by the helix

D

segments which contain many more residues than the strand segment. Secondly, all

TE

main chain heavy atoms (N, C, C and O) are considered to inter-position displacement calculations to make sure that the aligned residues have similar

CE P

directions for hydrogen bond formation. The structural similarity between two fragments i and j was scored by the fraction of aligned positions relative to fragment lengths (equation 1).

AC

Here

(1)

is the number of aligned residues, i.e., root mean square deviations main

chain heavy atoms below 2.5Å after structural superposition. The lengths of the fragments are noted as

and

, respectively. This score has a value between 0 and 1.

Sequence similarity. Based on their structure alignment, the sequence similarity between two fragments i and j is defined as the sum of the BLOSUM62 residue similarity scores (B62 ) over the aligned positions, i.e., (2) in which

and

represent the residue types at aligned position l in fragments i

and j, respectively. On average, this similarity score increases with the alignment 6

ACCEPTED MANUSCRIPT length. It thus reflects both structure similarity and sequence similarity. The score in equation (2) is scaled to give a nominal Z-score in equation (3),

T

(3) s.

IP

in which  and  are average and standard deviation of all possible pairwise

SC R

The scores in (1) and (3) form continuum spectra. They were further regularized to make it easier to identify recurring motifs using modern clustering algorithms such as Affinity Propagation(Frey and Dueck, 2007). The scheme of similarity

NU

regularization has been determined by trial and error, with many small variations on the scheme giving similar but not always exactly the same clustering results. The final

MA

scheme was chosen with a balanced consideration of cluster sizes, stability of clusters with respect to parameter perturbations, and eventually, manual inspection of the major clusters for structure/sequence consistency and for functional relevance (see

D

results below). First, the structural similarity scores obtained by equation (1) are

TE

regularized to values of 0 and 1 based on cutoffs, namely,

CE P

(4)

The cutoff scores

have been chosen so that a quarter of the pairwise scores are

transformed into 1. For the EH fragment set this cutoff is 0.54, while for the HE

AC

fragment the cutoff is 0.67. At these cutoffs, the minimum numbers of aligned positions are 10 for the EH set and 8 for the HE set (Supplementary Figure 1). Second, the regularized

was applied to mask the Z-score in equation (3) to give

the final similarity score for clustering, namely, (5)

Clustering. Affinity Propagation clustering was applied separately to the HE and the EH set with the above similarity matrix C as input. Relatively larger clusters were inspected for pairwise similarities within clusters. It was found that a cluster would occasionally contain a member that has high similarity score only to the representative member of the cluster assigned by the algorithm, but not to other members of the 7

ACCEPTED MANUSCRIPT cluster, while the remaining members showed high mutual similarities. More detailed inspection indicated this was mainly caused by accidental sequence similarity (identity) of the concerned fragment with the cluster representative at positions not

IP

T

conserved in the cluster as a whole. As the similarity scores have been evaluated with a small number of aligned position, such accidental sequence identify may lead to

SC R

high scores between unrelated fragments. To eliminate the effects of such misclassification, we removed cluster members which were close neighbors of less than half of the other members of the cluster. Here by close neighbors we mean

NU

fragment pairs with a masked Z score (equation (5)) of above 1.0. Then a sequence profile was generated for the aligned positions of each cluster, with all cluster

MA

members aligned to the cluster representative. From the profile, highly conserved sequence positions at which more than half of the cluster members have the same

D

residue types were identified. For clusters with 4 or more highly conserved positions,

TE

member fragments having the respective conserved residue types at less than 2 conserved positions were further removed from the cluster.

CE P

With the above protocol, we expected to filter out basic fragments that are most certainly recurring. Longer recurring fragments containing more than one basic fragments were then found by looking at recurring patterns of consecutive fragment

AC

types in the 108 proteins.

2.2. Experimental studies on a designed chimeric protein (1) Design of the chimeric protein We experimented on the idea of engineering artificial proteins using the recurring motifs not as building blocks, but as interfaces, to integrate natural partial domains. Two proteins (PDB: 1YXY and 1THF, respectively) were selected as donors of partial domains. 1YXY is a TIM-fold protein whose structure was entered into PDB without any publication until now and its structural stability has not been reported. 1THF is another TIM-fold protein from the hyperthermophilic species Thermotoga maritima. Thus it is expected to be of strong structural stability. In fact, several successful protein recombination studies have been reported using this protein as fragment 8

ACCEPTED MANUSCRIPT donors (Akanuma and Yamagishi, 2008; Bharat et al., 2008; Eisenbeis et al., 2012; Hocker, 2014; Hocker et al., 2004; Shanmugaratnam et al., 2012). According to the computational analysis, the 3rd and the 7th (αβα) units of these two proteins fall into

IP

T

the same or closely related structure-sequence motifs. Figure 1 illustrates the design of the chimeric protein CBA0 (chimeric barrel A0, or simply A0). Based on the

SC R

locations of the basic fragments of shared (or highly similar) types in this two proteins, the sequence of A0 was designed to be composed of three segments, the first segment from the N terminus to the middle of 3 taken from 1YXY, the second segment from

NU

the middle of 3 to the beginning of 7 taken from 1THF, and the last segment from 7 to the C terminus again taken from 1YXY. A few point mutations from the parent

MA

sequence were introduced: L162A and K196A were for avoiding clash. Moreover, the two parental TIM barrels have somewhat different hydrophobicity at the core of the

D

respective barrels. In 1YXY, the core of the barrel is exclusively hydrophobic, while

TE

in 1THF, the bottom of the barrel (the N-terminal of the  strands) is sealed with a ring of polar residues, with the rest of the interior of the barrel being mostly

CE P

hydrophobic. Thus further point mutations, I52V, K92V and E160F were introduced, all within the 1THF half and corresponding to residues forming the bottom ring of the barrel. These mutations replaced the residue types from those of 1THF to those in

AC

1YXY. As a result, the chimeric barrel has a purely hydrophobic interior. The gene coding A0 was synthesized and the protein prepared from recombinant expression (see experimental details) for further characterizations.

9

NU

SC R

IP

T

ACCEPTED MANUSCRIPT

TE

D

MA

Fig.1 Design of the chimeric protein A0 model. The protein 1THF is shown in blue and 1YXY in red. The (αβα)3 and (αβα)7 modules are shown by cartoon and aligned by TM align(Zhang and Skolnick, 2005) . The RMSD between the two proteins’ (αβα)3 module is 1.64 Å (left), and that of (αβα)7 module is 1.39 Å (middle). A new (βα)8-barrel protein model was obtained by combing half barrels cut from the similar (αβα) modules, replacing the (βα)3_7 part of 1YXY by the (βα)3_7 part of 1THF(right). The residues shown in green sticks were mutated: L162A and K196A for avoiding clash, and I59V, K92V and E160F for the hydrophobic unification of the new protein.

(2) Directed evolution of the chimeric protein

CE P

Variants of A0 were obtained through directed evolution aimed at improving protein structural stability. The in vivo directed evolution system of Bardwell and coworkers (Foit and Bardwell, 2013; Foit et al., 2009) was used. In this system, a

AC

protein of interest (POI, here A0 or a random mutant of A0) is inserted into the TEM β-lactamase at a specific position to form a fusion protein, which is expressed in host bacterium cells. As an unstable POI would result in increased proteolysis of the fusion protein, which would in turn lead to reduced/defected cellular β-lactamase activity, the stability of the POI is linked to ampicillin resistance of the host cells. This system was employed to improve the foldability of A0 through directed evolution. In each round of directed evolution, a library was generated with error prone PCR (Cadwell and Joyce, 1994; Cirino et al., 2003) from one or a few templates. The cells containing the library were selected for clones that survived an ampicillin concentration that prohibit growth for bacteria expressing the respective templates. Templates for each round were proteins selected from the previous round. The process was ended until further increase in antibiotics resistance could not be achieved in 10

ACCEPTED MANUSCRIPT several repetitive attempts. (3) Characterization of the chimeric proteins and its variants The chimeric protein and several variants were expressed, purified and

IP

T

characterized for a number of properties. General structure integrity was checked by protease K resistance. Thermal stability was measured with differential scanning

SC R

microcalorimetry (DSC) as well as differential scanning fluorimetry (DSF). Chemical-induced denaturation by guanidine chloride was monitored by circular dichroism spectrum and fluorescence spectroscopy.

NU

(4) Experimental details

Directed evolution. The templates and the mutant libraries were constructed into

MA

LFM10 vector (Foit and Bardwell, 2013; Foit et al., 2009) by ligating PCR products at BamHI and XhoI sites. The templates for error-prone PCR were firstly amplied by

D

PCR using PrimSTAR (TaKaRa). For the first and second rounds of directed

TE

evolution, the primers for both steps are caggatccatgaaaccgacgaaagaaaaactg and aactcgagcctttcagagcttcaataaag. In the third to the sixth rounds, primers for error-prone were

the

adaptor

CE P

PCR

primer

sequences

tgccacctgacgtctaagaa

and

attaccgcctttgagtgagc. To introduce the needed adaptor sequences, amplification by PrimSTAR

with

the

AC

tgccacctgacgtctaagaaggatccatgaaaccgacgaaagaaaaactg

extended

primers and

attaccgcctttgagtgagcctcgagcctttcagagcttcaataaag were used. The ligation reaction product was transformed into NEB 10-beta cells (New England Biolabs). Increased antibiotics resistance of selected colons were confirmed by reconstruction into the original LFM10 vector, retransformation into new NEB 10-beta cells, and antibiotics resistance assay with serial dilution (to eliminate the mutation of the plasmid backbone and bacterial genome).

Protein expression and purification. The DNA encoding the initial designed protein A0 and its variant AR703 from the final round of directed evolution were inserted into a modified pET-22b(+) vector separately. They were expressed as a His-tag fusion 11

ACCEPTED MANUSCRIPT protein in Escherichia coli BL21(DE3) by induction of 1 mM IPTG for 22 h at 16 ℃ (A0) or for 4 h at 37 ℃ (AR703) . The proteins were purified by Ni2+ affinity column chromatography. After purification, dithiothreitol (DTT, final concentration

IP

T

10 mM) or sufficient amount of H2O2 (about 20 times in mol ratio to the amount of protein) was added and incubated on ice for one hour to make the cysteine residues

SC R

reduced or oxidized. Gel-filtration chromatography with Superdex 200 16/60L column or Superdex 200 10/300 GL column (GE Healthcare) were utilized for further purification or analysis. The proteins for further physicochemical characterization

NU

were kept in buffer containing 20 mM Tris-HCl, 300 mM NaCl and 1mM EDTA (pH 8.0) unless otherwise indicated. When necessary, DTT was added (final

MA

concentrations 2-6 mM) in the buffer to keep the protein in reduced state for subsequent studies. The purified proteins were confirmed by SDS-PAGE and the

TE

D

concentrations were determined by the absorbance at 280nm.

Protease resistance assay. Protein sample of 1 mg/ml was incubated with

CE P

1/200000(V/V) proteinase K (TaKaRa, Code No. 9034) for 5 min at 37 ℃. The same reaction without proteinase K was carried out as reference. Digestion was terminated by immediately adding SDS-loading buffer and heating the sample at 100 ℃ for 10

AC

min. SDS-PAGE was carried to check the resistance to proteinase K.

Thermostabilty measurements. DSC experiments were performed on VP-DSC Microcalorimeter (MicroCal Inc.). 3 mg/ml protein was treated with the heating and cooling ramp from 10 ℃ to 85 ℃ at a scanning rate of 1 ℃/min. Origin lab software (MicroCal Inc.) was used to analyze the raw data to obtain a heat capacity profile. DSF measurements were performed on Roche LightCycler480 real-time PCR system with excitation at 455 nm and emission at 580 nm. DSF reaction mixtures composing of 16μM protein and 5× SYPRO Orange dye (Sigma-Aldrich) were heated from 25 ℃ to 80 ℃ at a rate of 2.4 ℃/min. The melting temperature was calculated with built-in software. 12

ACCEPTED MANUSCRIPT Chemical-induced denaturation. The circular dichroism spectra were obtained on a Jasco J-810 spectropolarimeter using a 1 mm path length quartz cuvette at room temperature. Measurements were carried out using 0.2 mg/ml protein samples in 20

T

mM phosphate buffer (pH 7.8) and each spectrum was recorded in the wavelength

IP

range of 260-200 nm. The data at λ=222 nm were followed. The intrinsic fluorescence

SC R

emission spectra were measured on a Shimadzu RF-5301PC spectrofluorophotometer, with an excitation wavelength of 280 nm over a wavelength range of 300-400 nm. An increasing concentration of guanidine hydrochloride was added to 0.4 mg/mg protein

NU

solution and equilibrated overnight at 4 ℃. The fluorescence at 334 nm was

MA

collected.

3. Results and Discussion

TE

D

3.1. The existence of recurring motifs in ()8 barrels. Clustering of the basic fragments led to 82 clusters that covered 397 HE

CE P

fragments and 80 clusters that covered 379 EH fragments with each cluster containing at least three members. For the HE fragments and EH fragments, Supplementary Tables 2 and 3 separately list the details for each cluster (referred to by a unique

AC

numeric ID) size (i.e., number of basic fragments forming the cluster), the averaged similarity score between cluster members, the number of conserved sequence positions across cluster members, as well as a list of all member fragments. A number of properties associated with the clustering results strongly support the existence of recurring HE and EH motifs in ()8 barrel proteins.

13

IP

T

ACCEPTED MANUSCRIPT

NU

SC R

Fig.2 (a) Cluster size distributions. Green: HE fragments, Yellow: EH fragments. (b) Distributions of the number of conserved positions in individual clusters. Green: HE fragments, Yellow: EH fragments. (c) The intra-cluster and inter-cluster sequence Z-score distributions. Yellow: Inter cluster sequence Z-score distribution of fragments in different clusters. Dark Green: Intra cluster sequence z-score distribution of fragments in the same clusters. Light Green: when two fragments are in the same cluster, other fragments in the two proteins are compared to determine the Z-score. (See the main text for details.)

MA

First, the larger the clusters, the less likely that the intra-cluster similarities have appeared by chance. For both the HE and the EH types of fragments, more than 50%

D

of the clusters comprised five or more members (Fig.2a).

TE

Second, more than 90% (80%) of the clusters contain at least one (two) conserved sequence positions (Fig.2b). The distributions in Figure 2b appear to be

CE P

multi-modal, suggesting that while there are a group of clusters that have few (less than five, mostly two or three) conserved positions, there are also a substantial number of clusters that have five or more conserved positions. We note that these

AC

conserved positions have been derived based on the alignments of structures, not sequences. Thus the probability of observing conserved residue types at aligned positions by chance should be small. Third, for most clusters, the mutually similar member fragments in one cluster are mostly from ()8 barrel proteins whose remaining parts are not more similar than fragments from proteins that do not contain any similar fragments. To demonstrate this, for every pair of ()8 barrel proteins that have at least one basic fragments assigned to the same cluster, we analyzed the similarity between the remaining fragments of the two proteins. In this analysis, only the fragments separated by the same number of -units from the shared fragments were paired for comparison. The resulting distribution of pairwise similarity scores is almost the same as that of the scores of unrelated fragment pairs, which is significantly shifted towards the side of 14

ACCEPTED MANUSCRIPT smaller scores as compared with the distribution of intra-cluster similarity scores (Fig.2c). The distributions of intra-cluster and inter-cluster similarity Z-scores are not

IP

T

strictly separable (Fig.2c). Given that the clustering scheme is not perfect, there is the possibility that similar fragments in separate clusters are actually related. Thus in

SC R

footnotes of Supplementary Tables 2 and 3, we list pairs of clusters that are associated with averaged inter-cluster similarity Z scores of above 1.5. For the HE clusters, there are 18 such pairs, most of them associated with a network of clusters that are

NU

relatively densely connected with medium mutual similarity, the network involving clusters 9, 11, 36, 38, 58, 63, and 102. For the EH clusters, only two such pairs were

MA

found.

To find recurring motifs containing multiple basic fragments, we analyzed clusters

D

of basic HE and EH motifs that co-occurred at consecutive positions in different ()8

TE

barrels. Cluster combinations that co-occurred in at least two proteins have been listed, together with the locations of the respective basic fragments in containing proteins

CE P

(Supplementary Table 4). Compared with the number of clustered basic fragments, there are relatively few recurrent extended motifs of multiple basic fragments. This indicates that combinations between the basic  and  units can vary a lot, and that

AC

there may be only a few specific combinations of basic fragments which are particularly favored. One of the most frequent extended motif is the  motif composed of HE66-EH64, the  loop in this motif corresponds to the well-known phosphate binding motif in ()8 barrels (see below). Besides this motif, many of the recurrent extended motifs (Supplementary Table 4) contained the basic motifs that are associated with the coordination of metal ions, including motifs EH54, EH97 and EH72 (see below). This is probably because metal coordination requires cooperative interactions involving residues on multiple  or  units. 3.2. Conserved residues in recurring HE motifs mainly play structural roles. The conserved positions and respective conserved residue types within the clusters were inspected in detail. For the clusters that show strong sequence conservation, 15

ACCEPTED MANUSCRIPT

MA

NU

SC R

IP

T

important sequence-structure or sequence-function relationships could be revealed.

TE

D

Fig.3 Examples of HE clusters. The sequence logos are shown left and their superimposed structures are shown right. The cluster members can be found in Supplementary Table 2. The conserved sites discussed in the main text are shown in red with their sidechains displayed.

As examples, the sequence logos and superimposed structures of four HE clusters

CE P

that have the highest intra-cluster sequence similarity Z scores are given in Fig.3. Each of these clusters contains at least five members. For three of them, namely, clusters HE102, HE36 and HE63, the conserved positions are concentrated in the 

AC

loop region. All three clusters have a conserved glycine at the beginning of the loop, and exhibit a two residue “GA” pattern comprising this glycine and an alanine next to it. Because of its small side chain, the glycine may signal the ending of the  helix, promoting a sharp turn of the peptide backbone at this position. As the most common residue type, alanine at the next position also contains a small side chain. In clusters HE102 and HE 63, the position right after the “GA” pattern is also highly conserved to be an aspartate. As the side chain of this aspartate invariantly forms a hydrogen bond with the N-terminal backbone of another neighboring -strand, it may be important for the stability of the overall ()8 barrels, as suggested and experimentally tested previously(Nagano et al., 2002). In cluster HE102, the “GAD” sequence pattern is consecutively extended by another conserved glycine. Cluster 16

ACCEPTED MANUSCRIPT HE36 differs from clusters HE102 and HE63 in that it presents only the “GA” but not the “GAD” or “GADG” sequence patterns in the  loop. However, at the position one residue away from the “GA” pattern, cluster HE36 shows strong preferences for

IP

T

residues of large aromatic side chains.

The other example, HE74, exhibits a different pattern of sequence conservation

SC R

(Fig.3). Peptide leaves the  loop and enters the  strand with a three-residue “IPV” sequence motif. In addition, the large hydrophobic side chains of the conserved isoleucine and valine in this motif are from buried hydrophobic cluster with conserved

NU

leucine/valine sidechains from the N-terminal of the  helix. We note that HE74 has a relatively wide spread of structures among its members.

MA

The above examples suggest that the conserved residues in recurring  motifs mainly play structural roles. It is consistent with the observation that inside the ()8

D

barrels, the active sites are usually not located near the C-terminus of the inner

TE

-strand barrel, that is, close to the  loops but not the  loops. 3.3. Conserved residues in recurring EH motifs may play important functional

CE P

roles.

There are 45 EH clusters in total that each has at least four members, three conserved positions and intra-cluster similarity Z-score above 1.5. Manual inspection

AC

of these clusters suggested that the positions and types of the conserved residues in these EH clusters are much more diverse than those in the HE clusters. In different clusters, conserved residues of varied conserved amino acid types are found in different regions within the basic  strand- loop- helix framework, while for a particular cluster its conserved positions can be concentrated in one or two regions. Furthermore, we focus on EH clusters that have conserved residues with polar sidechains in the C terminal region of the  strand or in the nearby  loop. Compared with other clusters, these clusters or motifs are more likely to be associated with functions, because the active sites of intact ()8 barrel proteins are usually formed by residues in this region. In addition, the conserved residues forming an active site usually include those with polar side chains. The sequence logos and superimposed structures for seven such clusters are shown (Fig.4). In Supplementary 17

ACCEPTED MANUSCRIPT Figure 2, we show as examples the sequence logos and superimposed structures of two clusters (cluster 2 and 93) whose conserved residues are located not in the designated region or not of polar sidechain types. Such motifs may play important

AC

CE P

TE

D

MA

NU

SC R

roles based on their sequence conservation patterns.

IP

T

structural roles in ()8 barrel proteins, while it is difficult to discuss their functional

Fig.4 Examples of EH clusters. The sequence logos are shown left and their superimposed structures are shown right. The cluster members can be found in Supplementary Table 3. The conserved sites discussed in the main text are shown in red with their sidechains displayed.

As shown in Fig. 4, cluster EH64 corresponds to the well-known phosphate binding motif. The sequence logo of this cluster shows that to varied extents, sequence conservation is exhibited throughout the entire fragment, with the most conserved residues being at the N and C terminal ends of the  strand. Cluster EH54, exhibits strong sequence conservation in the  strand. An invariant Asp can be found at the C-terminal end of the  strand. Inspection of the PDB structures of proteins containing the member fragments of EH54 revealed that this Asp takes part in coordination with a divalent metal ion, such as iron, nickel or zinc. 18

ACCEPTED MANUSCRIPT Cluster EH97 has a conserved DXHXH (the ‘X’ represents variable residue types) sequence motif that spans the  strand. In corresponding proteins the two histidine residues in this motif participate in divalent metal (namely, zinc or nickel)

IP

T

coordination, while the sidechain of the C-terminal Asp residue seems to play a structural role by forming a hydrogen bond with the backbone of a neighboring strand.

SC R

Another conserved feature of fragments in this cluster is the glycine at the C-terminal end of the -helix.

Cluster EH72 has a conserved DXXH motif. The  loop has the form of a short

NU

helix in this cluster. The conserved Asp is at the C-terminal end of the  strand while the conserved His is at the beginning of the short helix. Again, these two residues take

MA

part in divalent metal ion (zinc or magnesium) binding in the containing proteins. Most of the conserved positions for clusters EH35, EH79 and EH15 are located on

D

the  helix, especially at the C-terminal end of the helix. Such locations and the

TE

conserved residue types (for examples, Ala and Gly at the C-terminal of EH35) suggested structural roles of these conserved residues. However, each of these clusters

CE P

shows a moderately conserved polar residue on the -strand that might play functional roles. In cluster EH35, this is a conserved Asp at the C-terminal end of the  strand and pointing towards the interior of the ()8 barrel. This Asp seems to play important

AC

but varied functional roles in different containing proteins. For examples, it coordinates with a magnesium in the protein structure 1F61, while forms a hydrogen bond with a small molecule ligand in the protein structure 3BOF. In cluster EH79, an intermediately conserved histidine is also located at the C-terminal end of the  strand. When present, it is also observed to interact directly with small molecule or metal ion ligands in containing proteins. In cluster EH35, a moderately conserved histidine is found in the middle of the  strand. Probably because it is deeper inside the barrel compared to the conserved His in EH79, this histidine is generally not observed to directly interact with ligands in containing proteins. However, its side chain participates in hydrogen bonding networks with polar sidechains from neighboring  strands, sometimes bridged by buried water molecules. Together these groups may form polar bottoms of active site pockets, presumably providing favorable 19

ACCEPTED MANUSCRIPT environments for the binding of polar ligands by respective containing proteins. 3.4. The chimeric barrel A0 is soluble but not in a well-folded state

T

To test the idea of engineering artificial proteins using the recurring motifs

IP

discussed above as interfaces to integrate natural partial domains, a chimeric protein A0 has been designed based on proteins 1YXY and 1THF. The combined new protein

SC R

A0 is soluble when expressed at 16 ℃with a C-terminal His-tag and showed one main peak when purified by gel filtration (Fig.6c). CD spectrum shows that a significant

NU

portion of A0 forms secondary structures consistent with / proteins (Supplementary Figure 4). The unfolding curves follow a two state model (Fig.7b-d). These all suggest

MA

that the new barrel protein A0 is partially folded. However, A0 is slightly degraded after purification (Fig.6e) and the gel filtration results with higher efficiency column have two main peaks (Fig.6a). In addition, differential scanning calorimetry (DSC)

D

fails to present a sharp peak (Fig.7a). When inserted into the β-lactamase tripartite

TE

fusion system (Foit and Bardwell, 2013; Foit et al., 2009), the resulting ampicillin

state.

CE P

resistance is low (Fig.5a). These indicate that A0 is probably not in a well-folded

3.5. Directed evolution of A0 led to a well-folded (βα)8-barrel protein

AC

After selection on plates with increasing concentrations of ampicillin, mutants of A0 with potentially improved stability and foldability were generated. They were subjected to successive rounds of directed evolution implied by the improved ampicillin resistance (Fig.5 and Supplementary Table 5). After six rounds, AR703, one of the mutants that led to the highest ampicillin resistance, was picked for further characterization. The mutated amino acids contained in AR703 are shown on the A0 structure model (Fig.5h). R229C showed very high frequency in the earlier rounds, and it might lead to disulfide bond formation in the relatively oxidative environment of the periplasmic space where the beta-lactamase functions. Thus AR703 was characterized in a reductive as well as in an oxidative environment (named AR703_red and AR703_ox, respectively) after purification with the Ni2+ column.

20

NU

SC R

IP

T

ACCEPTED MANUSCRIPT

AC

CE P

TE

D

MA

Fig. 5 Directed evolution of A0. (a) The effect of directed evolution was tested by spot titer experiment on a plate containing 0.4mg/ml ampicillin. A0 is the original protein, and AB009, AF10, AK102, AP408, AQ803 and AR703 are the best mutants from the first to the sixth rounds of directed evolution. (b-g) The sites mutated in each round are shown by sticks. The 1YXY part is shown in red and the sites of this part mutated in each round are shown in green. The 1THF part is shown in blue and the sites of this part mutated in each round are shown in yellow. The mutation sites mutated in the previous rounds are shown in cyan. (h) The AR703 mutant from the final round was picked for further characterization. The sites mutated in 1YXY part are I16N, L23P, Y29D, E31D, M36I, G47D, V56D, V216E, G217A, and R229C (shown in green sticks). The sites mutated in 1THF part are G74V, G75D, F131Y, I166N, and P185S (shown in yellow sticks).

21

MA

NU

SC R

IP

T

ACCEPTED MANUSCRIPT

CE P

TE

D

Fig.6 The gel filtration, solubility and proteinase resistance results. (a) The gel filtration curve of AR703 mutant was significantly improved, which turned into a sharp mono peak in in comparison to the original protein A0 (both in reduction state). (b) The oxidized AR703 shows a more packed state according to its larger elution volume, implying the folding back of the α helix containing the R229C mutation. (c) However, the original A0 shows only one wide main peak during the purification, which may due to the lower efficiency of the column. (d) Positions of residues R229 and C20 (green sticks) in a structural model generated by simply merging the two half structures

AC

from 1YXY (red) and 1THF (blue), respectively. (e) When expressed at 37℃, AR703 was totally soluble while the original A0 was not. (‘w’ stands for whole cell and ‘s’ stands for supernatant). When treated with proteinase K for 5min, AR703 was less digested (‘0 min’ stands for no treatment with proteinase K). And AR703 was also less automatically degraded during the expression and purification.

(1) AR703 has been significantly improved into a well-folded protein This is supported with a range of evidence. Firstly, the solubility and resistance to proteolysis are apparently improved. The original A0 is not soluble when expressed at 37 ℃, but AR703 was soluble under the same condition (Fig.6e). Although A0 is soluble when expressed at 16 ℃, it was insoluble when constructed into a modified pet28a vector with His-tag at both terminals to prevent the slight degradation. On the other hand, AR703 is again soluble when expressed on a pet28a vector at 37 ℃ (data not shown). According to the SDS-PAGE result, AR703 is less degraded during the expression or purification and is more resistant to protease treatment, indicating it is 22

ACCEPTED MANUSCRIPT structurally more stable. (Fig.6e) Secondly, the dispersion state of AR703 in solution is improved relative to A0 according to gel filtration. The gel filtration results for A0 has two peaks (Fig.6a)

IP

T

although it showed only one peak during purification by a Superdex200 16/60 column (Fig.6c), probably due to the higher efficiency of an analytical column. One of the

SC R

parent proteins of A0, 1YXY, is in a domain swapped dimeric state. A0 has inherited the partial barrel that participates this interaction, and the similar domain swapped interactions might account for the apparently aggregated state (Fig.6d), although we

NU

could not yet provide experimental evidence for this hypothesis. On the other hand, the evolved AR703 shows only one sharp peak, indicating improvements in molecular

AC

CE P

TE

D

MA

disperse state.

Fig.7 The thermal and chemical unfolding curves. (a) The DSC results showed the significant improvement of the AR703 mutant and the further improvement by the formation of disulfide bond. (b) DSF results showed similar results with DSC. (c) The mean residual ellipticity (MRE) at 222nm was collected after the equilibrium with different concentrations of GuHCl. (d) The unfolding curve was also obtained by collecting tryptophan fluorescence at 334nm excited at 280nm. The GuHCl equilibrated unfolding also shows the increased stability of the AR703 mutant. 23

ACCEPTED MANUSCRIPT

Thirdly, the thermostability of AR703 has been significantly improved. The melting temperatures of A0 and AR703_red determined by DSC are 45.8 ℃ and

T

58.1 ℃ , respectively. AR703_ox is even more thermostable, with a melting

IP

temperature of 66.7 ℃ (Fig.7a). The DSF results are in good consistence with DSC

SC R

results, with the melting temperatures detected being 50.5 ℃, 63.7 ℃ and 72.0 ℃, for the molecules, respectively (Fig.7b).

Finally, the resistance of folded AR703 to denaturants is also increased as

NU

compared with A0. According to the unfolding curve detected by circular dichroism, AR703 is much more stable than the original A0, AR703_ox being also more stable

MA

than the reduced AR703_red (Fig.7c). The unfolding curve detected by fluorescence gives similar results (Fig.7d).

Possible causes of the significant differences between AR703_ox and AR703_red

D

worth further discussions. The only two cysteine residues available to form an

TE

intra-peptide chain disulfide bond are R229C and C20. If we built a structural model

CE P

of the chimera just by putting together the structures of the parent proteins together, R229C would be too far from C20 for forming a disulfide bond, this is because in the structure of the parent protein 1YXY, the C-terminal helix containing R229 extends

AC

away from the main body of the protein and participates in domain-swapped interactions with another peptide chain (See Fig.6d). As AR703 is a monomer, that helix should no longer be interacting with another peptide chain in this evolved chimeric protein. On the other hand, the existence of a disulfide bond under the oxidative condition is strongly supported by the significant differences between AR703_ox as AR703_red. In addition, for the disulfide bond to form, R229C should be brought close to C20. This could be made possible if, relative to its position in the parental 1YXY structure, the helix containing R229C folds back towards the main body of the protein. The suggested change would make the protein structure to be more compact. In support of this, the apparent molecular size of AR703_ox determined by gel filtration was determined to be smaller than AR703_red (Fig.6b). These data strongly indicate that AR703, especially its non-reduced form, form a 24

ACCEPTED MANUSCRIPT well-folded globular structure, although we cannot provide an atomic model of it because efforts to obtain crystals of AR703 has been unsuccessful till now.

IP

T

(2) The partial barrels modules are of different intrinsic stabilities

The mutations encountered in the directed evolution rounds are listed in

SC R

Supplementary Table 5. While it is not strictly true to interpret the effects of the individual mutations as always being positive because neutral or not so harmful mutations could have been accumulated during directed evolution, the collection of

NU

mutations at each round of the six rounds of directed evolution could be considered as adaptively improving. Because of this and also because relatively few mutations were

MA

added at each round, the majority of the finally accumulated mutations should have positive effects.

D

In terms of sequence positions, the mutations were mainly found to be within the

TE

two partial barriers instead of within the super-secondary structure motifs at the interfaces. There is a clear asymmetry in the distribution of mutations inside the two

CE P

partial barrels, especially in the earlier rounds: selected mutations are significantly concentrated on the partial barrel inherited from 1XYX. This indicates that the starting partial barrel from 1THF is by itself quite stable and difficult to improve,

AC

most random mutations tending to be disruptive. To further look into the folding status of this partial barrel in the starting and the evolved chimeric protein, the fluorescence of the single tryptophan in the chimeric proteins was measured. This tryptophan locates well inside the core of the 1THF partial barrel. The maximum emission wavelengths of A0, AR703_red and AR703_ox are exactly the same (Supplementary Figure 3), suggesting similar environments of the tryptophan (in 1THF part) in these different proteins or protein forms. This supported that the 1THF partial barrel part was well-folded in the designed chimeric protein, despite that this part is only a partial not an intact structural domain of the parent protein. This observation suggests that engineering new proteins from partial domains is a viable approach. It is interesting to note that while R229C has been selected with very high 25

ACCEPTED MANUSCRIPT frequency in the first round of selection (it has been confirmed that none of the nine colonies randomly picked before the selection contains R229C mutation), the final AR703_ox indeed folds much better than AR703_red. A possible explanation is that

IP

T

R229C may lead to intra-molecular disulfide bond with the only cysteine (C20), leading to a compact structure that could be less likely to aggregate. This is supported

SC R

by gel filtration for the smaller molecular size of AR703_ox in comparison with AR703_red (Fig.6c). In addition, this disulfide bond contributed significantly to the stability according to the DSC and DSF results (Fig.7ab). This may imply an

NU

important function for newly formed intra-molecular disulfide bond during protein evolution: to turn a loosely packed structure into a compact one with as few as only a

MA

single mutation. Subsequent evolution may further optimize the packing. Although the folded tertiary structure should be much better formed and much

D

more stable in AR703 than in the starting chimeric A0, the secondary structure

TE

compositions of both proteins may still be quite similar to that of the parent native proteins. The circular dichroism spectrums of A0 and the different forms of AR703

CE P

proteins are only slightly different (Supplementary Figure 4). It will not be unexpected that A0 has inherited most of the secondary structures of the parent proteins, and the directed evolution has mainly optimized the packing between

AC

secondary structures.

4. Conclusion Engineering new proteins using partial domains from natural proteins as building blocks is an attractive approach for protein engineering as well as for deciphering protein evolution. Although various bioinformatics approaches have been proposed to segment natural proteins into substructures for various purposes, only a small number of structural or bioinformatics analysis strategies have been proposed to suggest building parts and assembling strategies from a protein engineering perspective (Bauer et al., 2006; Endelman et al., 2004; Pantazes et al., 2007; Silberg et al., 2004; Voigt et al., 2002). 26

ACCEPTED MANUSCRIPT In this work, we used a structure alignment-based sequence clustering approach to systematically analyze the  and  motifs in TIM barrel proteins. A number of recurring sequence-structure motifs have been identified. Besides looking for their

IP

T

implications in sequence-structure-function relationships of TIM barrels, we aimed at exploiting these motifs for protein engineering. Here we explored an engineering

SC R

strategy of using a recurring motif as a conserved interface to fit two partial barrels together. The initial chimeric protein A0 designed by this strategy is soluble but not well-folded. However, directed evolution by a -lactamase tripartite fusion system

NU

was able to significantly optimize the folding. Eventually, after six rounds of directed evolution, a well-folded mutant AR703 was obtained. Our results suggest that with

MA

reasonable optimization efforts using directed evolution, novel well-folded chimeric proteins can be feasibly obtained based on systematic bioinformatics analysis of

D

natural proteins. On the other hand, the initially designed A0 being a relatively poor

TE

folder suggests that new analysis methods and design strategies may still need to be explored to increase the design success rate and to reduce the burden on experimental

CE P

optimization.

Acknowledgements

AC

We would like to thank Dr. James Bardwell for the plasmid TEM1-β-lactamase, Yanwei Ding for help in the DSC experiments, Zhenhua Shao, Kai Yang, Zexian Liu, Wei Yan, Jian Zhan and Wei Zhao for helps and discussions. J.W. thanks Deqing Ma for the encouragement and wishes her a happy wedding. This work has been supported by grants from the National Natural Science Foundation of China (31200546, 31470717 to Q.C. and 31370755 to H.L.).

References Bauer, D.C., Boden, M., Thier, R., and Gillam, E.M. (2006). STAR: predicting recombination sites from amino acid sequence. Bmc Bioinformatics 7. Blaber, M., and Lee, J. (2012). Designing proteins from simple motifs: opportunities in Top-Down Symmetric Deconstruction. Curr Opin Struct Biol 22, 442-450. Brändén, C.-I. (1991). The TIM barrel—the most frequently occurring folding motif in proteins: Current 27

ACCEPTED MANUSCRIPT Opinion in Structural Biology 1991, 1:978–983. Curr Opin Struc Biol 1, 978-983. Broom, A., Doxey, A.C., Lobsanov, Y.D., Berthin, L.G., Rose, D.R., Howell, P.L., McConkey, B.J., and Meiering, E.M. (2012). Modular evolution and the origins of symmetry: reconstruction of a three-fold symmetric globular protein. Structure 20, 161-171.

T

Cadwell, R.C., and Joyce, G.F. (1994). Mutagenic PCR. PCR Methods Appl 3, S136-140. Cirino, P.C., Mayer, K.M., and Umeno, D. (2003). Generating mutant libraries using error-prone PCR.

IP

Methods Mol Biol 231, 3-9.

Dahiyat, B.I., and Mayo, S.L. (1997). De novo protein design: fully automated sequence selection.

SC R

Science 278, 82-87.

Eisenbeis, S., Proffitt, W., Coles, M., Truffault, V., Shanmugaratnam, S., Meiler, J., and Hocker, B. (2012). Potential of fragment recombination for rational design of proteins. J Am Chem Soc 134, 4019-4022. Endelman, J.B., Silberg, J.J., Wang, Z.G., and Arnold, F.H. (2004). Site-directed protein recombination as

NU

a shortest-path problem. Protein Engineering Design & Selection 17, 589-594. Farias-Rico, J.A., Schmidt, S., and Hocker, B. (2014). Evolutionary relationship of two ancient protein superfolds. Nat Chem Biol 10, 710-715.

MA

Foit, L., and Bardwell, J.C. (2013). A tripartite fusion system for the selection of protein variants with increased stability in vivo. Methods Mol Biol 978, 1-20.

Foit, L., Morgan, G.J., Kern, M.J., Steimer, L.R., von Hacht, A.A., Titchmarsh, J., Warriner, S.L., Radford, S.E., and Bardwell, J.C. (2009). Optimizing protein stability in vivo. Mol Cell 36, 861-871.

D

Fortenberry, C., Bowman, E.A., Proffitt, W., Dorr, B., Combs, S., Harp, J., Mizoue, L., and Meiler, J.

TE

(2011). Exploring symmetry as an avenue to the computational design of large protein domains. J Am Chem Soc 133, 18026-18029. 972-976.

CE P

Frey, B.J., and Dueck, D. (2007). Clustering by passing messages between data points. Science 315, Heinig, M., and Frishman, D. (2004). STRIDE: a web server for secondary structure assignment from known atomic coordinates of proteins. Nucleic Acids Res 32, W500-502. Hocker, B., Beismann-Driemeyer, S., Hettwer, S., Lustig, A., and Sterner, R. (2001). Dissection of a

AC

(betaalpha)8-barrel enzyme into two folded halves. Nature structural biology 8, 32-36. Huang, P.S., Feldmeier, K., Parmeggiani, F., Fernandez Velasco, D.A., Hocker, B., and Baker, D. (2016). De novo design of a four-fold symmetric TIM-barrel protein with atomic-level accuracy. Nat Chem Biol 12, 29-34.

Jacobs, T.M., Williams, B., Williams, T., Xu, X., Eletsky, A., Federizon, J.F., Szyperski, T., and Kuhlman, B. (2016). Design of structurally distinct proteins using strategies inspired by evolution. Science 352, 687-690. Koga, N., Tatsumi-Koga, R., Liu, G., Xiao, R., Acton, T.B., Montelione, G.T., and Baker, D. (2012). Principles for designing ideal protein structures. Nature 491, 222-227. Kuhlman, B., Dantas, G., Ireton, G.C., Varani, G., Stoddard, B.L., and Baker, D. (2003). Design of a novel globular protein fold with atomic-level accuracy. Science 302, 1364-1368. Lang, D., Thoma, R., Henn-Sax, M., Sterner, R., and Wilmanns, M. (2000). Structural evidence for evolution of the beta/alpha barrel scaffold by gene duplication and fusion. Science 289, 1546-1550. Li, Z., Yang, Y., Zhan, J., Dai, L., and Zhou, Y. (2013). Energy functions in de novo protein design: current challenges and future prospects. Annual review of biophysics 42, 315-335. Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247, 536-540. 28

ACCEPTED MANUSCRIPT Nagano, N., Orengo, C.A., and Thornton, J.M. (2002). One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions. J Mol Biol 321, 741-765. Nikkhah, M., Jawad-Alami, Z., Demydchuk, M., Ribbons, D., and Paoli, M. (2006). Engineering of

T

beta-propeller protein scaffolds by multiple gene duplication and fusion of an idealized WD repeat. Biomol Eng 23, 185-194.

IP

Pantazes, R.J., Saraf, M.C., and Maranas, C.D. (2007). Optimal protein library design using recombination or point mutations based on sequence-based scoring functions. Protein Engineering

SC R

Design & Selection 20, 361-373.

Richter, M., Bosnali, M., Carstensen, L., Seitz, T., Durchschlag, H., Blanquart, S., Merkl, R., and Sterner, R. (2010). Computational and experimental evidence for the evolution of a (beta alpha)8-barrel protein from an ancestral quarter-barrel stabilised by disulfide bonds. J Mol Biol 398, 763-773.

NU

Saab-Rincon, G., Olvera, L., Olvera, M., Rudino-Pinera, E., Benites, E., Soberon, X., and Morett, E. (2012). Evolutionary walk between (beta/alpha)(8) barrels: catalytic migration from triosephosphate isomerase to thiamin phosphate synthase. J Mol Biol 416, 255-270.

MA

Silberg, J.J., Endelman, J.B., and Arnold, F.H. (2004). SCHEMA-guided protein recombination. Protein engineering 388, 35-42.

Soberon, X., Fuentes-Gallego, P., and Saab-Rincon, G. (2004). In vivo fragment complementation of a (beta/alpha)(8) barrel protein: generation of variability by recombination. FEBS Lett 560, 167-172.

TE

peptides. Bioessays 25, 837-846.

D

Soding, J., and Lupas, A.N. (2003). More than the sum of their parts: on the evolution of proteins from Sullivan, B.J., Durani, V., and Magliery, T.J. (2011). Triosephosphate isomerase by consensus design: dramatic differences in physical properties and activity of related variants. J Mol Biol 413, 195-208.

CE P

Sussman, J.L., Lin, D., Jiang, J., Manning, N.O., Prilusky, J., Ritter, O., and Abola, E.E. (1998). Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta crystallographica Section D, Biological crystallography 54, 1078-1084. Toh, H. (1997). Introduction of a distance cut-off into structural alignment by the double dynamic

AC

programming algorithm. Comput Appl Biosci 13, 387-396. Tomii, K., Sawada, Y., and Honda, S. (2012). Convergent evolution in structural elements of proteins investigated using cross profile analysis. BMC Bioinformatics 13, 11. Voigt, C.A., Martinez, C., Wang, Z.G., Mayo, S.L., and Arnold, F.H. (2002). Protein building blocks preserved by recombination. Nature structural biology 9, 553-558. Watanabe, H., Yamasaki, K., and Honda, S. (2014). Tracing primordial protein evolution through structurally guided stepwise segment elongation. J Biol Chem 289, 3394-3404. Xiong, P., Wang, M., Zhou, X., Zhang, T., Zhang, J., Chen, Q., and Liu, H. (2014). Protein design with a comprehensive statistical energy function and boosted by experimental selection for foldability. Nat Commun 5, 5330. Yadid, I., and Tawfik, D.S. (2007). Reconstruction of functional beta-propeller lectins via homo-oligomeric assembly of shorter fragments. J Mol Biol 365, 10-17. Zhang, Y., and Skolnick, J. (2005). TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 33, 2302-2309.

29

ACCEPTED MANUSCRIPT Conflict of Interest: JA and JAS hold patent (CA2757917A1) rights to some applications of TPP-IOA

AC

CE P

TE

D

MA

NU

SC R

IP

T

(which are not supported by these data!).

30

ACCEPTED MANUSCRIPT Highlights The  and  modules of TIM barrel proteins were clustered based on structure and sequence.

T

A number of recurring motifs have been identified.

AC

CE P

TE

D

MA

NU

SC R

A0 was significantly improved by six rounds of directed evolution.

IP

A chimeric protein A0 was created by using the recurring motifs as interfaces.

31