Recurring sequence-structure motifs in (βα) 8 -barrel proteins and experimental optimization of a chimeric protein designed based on such motifs Jichao Wang, Tongchuan Zhang, Ruicun Liu, Meilin Song, Juncheng Wang, Jiong Hong, Quan Chen, Haiyan Liu PII: DOI: Reference:
S1570-9639(16)30228-X doi:10.1016/j.bbapap.2016.11.001 BBAPAP 39848
To appear in:
BBA - Proteins and Proteomics
Received date: Revised date: Accepted date:
1 August 2016 4 November 2016 6 November 2016
Please cite this article as: Jichao Wang, Tongchuan Zhang, Ruicun Liu, Meilin Song, Juncheng Wang, Jiong Hong, Quan Chen, Haiyan Liu, Recurring sequencestructure motifs in (βα)8 -barrel proteins and experimental optimization of a chimeric protein designed based on such motifs, BBA - Proteins and Proteomics (2016), doi:10.1016/j.bbapap.2016.11.001
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Recurring sequence-structure motifs in (βα)8-barrel
T
proteins and experimental optimization of a chimeric
IP
protein designed based on such motifs
a
SC R
Jichao Wanga, Tongchuan Zhanga, Ruicun Liua, Meilin Songa, Juncheng Wanga, Jiong Honga, Quan Chena,*, Haiyan Liua,b,c,* School of Life Sciences, University of Science and Technology of China, Hefei, Anhui, 230027,
NU
China
Hefei National Laboratory for Physical Sciences at the Microscale, Hefei, Anhui, 230027, China
c
Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, Anhui, 230031, China
MA
b
D
Abstract
TE
An interesting way of generating novel artificial proteins is to combine sequence motifs from natural proteins, mimicking the evolutionary path suggested by natural
CE P
proteins comprising recurring motifs. We analyzed the and modules of TIM barrel proteins by structure alignment-based sequence clustering. A number of preferred motifs were identified. A chimeric TIM was designed by using recurring
AC
elements as mutually compatible interfaces. The foldability of the designed TIM protein was then significantly improved by six rounds of directed evolution. The melting temperature has been improved by more than 20 ℃ . A variety of characteristics suggested that the resulting protein is well-folded. Our analysis provided a library of peptide motifs that is potentially useful for different protein engineering studies. The protein engineering strategy of using recurring motifs as interfaces to connect partial natural proteins may be applied to other protein folds.
Keywords: common secondary structure unit; chimeric protein design; directed evolution;
1
ACCEPTED MANUSCRIPT 1. Introduction Strategies to create artificial proteins with desired structures and/or functions are
T
of sustained interest for basic understanding of protein and protein engineering. De
IP
novo protein design holds great promises while exciting progresses have been made
SC R
(Dahiyat and Mayo, 1997; Huang et al., 2016; Koga et al., 2012; Kuhlman et al., 2003; Xiong et al., 2014). However, currently the success rate of de novo design is still limited, with so far few examples of success for relatively large proteins(Li et al.,
NU
2013). On the other hand, analysis of the sequence and structure organizations of natural proteins pointed to alternative approaches to engineering artificial
MA
proteins(Blaber and Lee, 2012; Broom et al., 2012; Hocker et al., 2001; Nikkhah et al., 2006; Yadid and Tawfik, 2007). It has been suggested that larger proteins may contain
D
smaller sequence/structure units that could be considered as core motifs, whose
TE
duplication, recombination or extension may lead to rapid generation of new proteins during evolution (Soding and Lupas, 2003; Tomii et al., 2012). In a number of studies,
CE P
well-folded engineered proteins have been created with repetitive motifs (Broom et al., 2012; Nikkhah et al., 2006; Yadid and Tawfik, 2007) or by extending a core motif(Watanabe et al., 2014). More recently, Jacobs et al. proposed a computational
AC
algorithm to construct distinct protein structures by merging motifs of multiple secondary structure units from native proteins(Jacobs et al., 2016). Thus it is highly interesting to investigate recurring sequence-structure motifs in natural proteins, and to explore their applications in protein engineering besides their implications for protein function and evolution. An interesting target for the systematic analysis of sequence and structural motifs is the (βα)8-barrel protein fold. It is also called the triosephosphate isomerase fold or TIM barrel fold. This fold is the most common one adopted by natural enzymes(Brändén, 1991). Despite their high structural similarity, members of the fold show extensive diversity in sequences and functions(Nagano et al., 2002). As an intact domain, the TIM barrel is formed by eight tandemly arranged (βα) units. Its modular organization at the subdomain level has attracted substantial attentions because of 2
ACCEPTED MANUSCRIPT implications for the evolutionary origin of new protein domains(Hocker et al., 2001; Lang et al., 2000; Richter et al., 2010) as well as for protein design(Eisenbeis et al., 2012; Fortenberry et al., 2011; Huang et al., 2016; Soberon et al., 2004). Lang et al.
IP
T
compared the atomic structures of two TIM-barrel proteins, HisA and HisF, which revealed that two-fold gene duplication and gene fusion from a common half-barrel
SC R
ancestor led to the complete barrels of these enzymes(Lang et al., 2000). Hoker et al. went on to demonstrate that the half barrels in HisF form independent stable folding units (Hocker et al., 2001). Later on, Richter et al. suggested that a (βα)2 quarter barrel
NU
unit predecessor might fuse twice to yield the extant (βα)8 barrels based on computational and experimental evidence(Richter et al., 2010). More recently,
MA
Farías-rico et al. found high sequence and structure similarity of certain (βα)2 segments of TIM-barrel proteins with (βα)2 segments in flavodoxin-like fold
D
domains(Farias-Rico et al., 2014). They also detected a family of sequences showing
TE
intermediate features between the two folds and determined the structure of one member of this family, which confirmed the cross fold conservation of the structure of
CE P
the (βα)2 unit(Farias-Rico et al., 2014). A variety of protein design and engineering strategies have been reported and tested on TIM-barrel proteins, such as sequence consensus-based design(Sullivan et
AC
al., 2011) or catalytic migration(Saab-Rincon et al., 2012). Of particular relevance here are those engineering studies based on protein fragments, which could be viewed as inspired by the modular organization of natural TIM barrels. Notably, Eisenbeis et al. reconstituted an intact, well-folded artificial eight-unit barrel from fragments with different folds, one fragment from the TIM-barrel protein HisF and the other from a flavodoxin-like fold protein CheY(Eisenbeis et al., 2012). Fortenberry et al. reported the successful engineering of a perfectly symmetric variant of HisF (Fortenberry et al., 2011). Insights from these previous studies suggest that it is worthwhile to systematically identify and compile recurring sequence and structure elements in TIM barrel proteins. In addition, it would also be interesting to explore new ideas that exploit these elements for protein engineering. In the current study, we applied a structure 3
ACCEPTED MANUSCRIPT alignment-based
sequence
clustering
approach
to
analyze
such
elements
systematically. Most previous systematic analyses of TIM-barrels are focused on the comparisons between full-length sequences and structures. Some interesting local
IP
T
motifs can be identified from this approach, such as the phosphate binding motif emphasized by Nagano et al.(Nagano et al., 2002). However, we expect that a
SC R
systematic fragment-centric analysis may provide a more complete picture about the types and distributions of local motifs than the full-length-centric studies. We retrieved the and fragments from a set of non-redundant TIM-barrel proteins as
NU
the basic fragments to be clustered. This fragmentation scheme allows elements of variable lengths to be extracted, differing from approaches considering contiguous
MA
fragments of fixed lengths. In addition, several adjustments to the usual dynamic-programing-based structure alignment approach were applied to make sure
D
that the alignments between fragments are always compatible with the alignments
TE
between complete barrels. After identifying the most frequent and elements, longer elements were identified through their extensions.
CE P
Finally, we tested an engineering strategy of using recurring elements as conserved interfaces or recombination sites between partial domains of natural TIM barrels. This strategy is different from that adopted in previous engineering efforts in
AC
which recurring elements were used mainly as components for recombination (Broom et al., 2012; Eisenbeis et al., 2012; Nikkhah et al., 2006; Yadid and Tawfik, 2007). Although an initial chimeric protein constructed with this strategy does not show ideal properties as a stable, well-folded monomeric globular domain, its folding properties can be greatly improved by an in vivo directed evolution approach optimizing protein stability and foldability(Foit and Bardwell, 2013; Foit et al., 2009). After six rounds of evolution, the foldability of the engineered protein is significantly improved according to various experimental evidence, especially the melting temperature is improved by more than 20 ℃.
4
ACCEPTED MANUSCRIPT 2. Materials and Methods 2.1 Computational analysis of recurring motifs
T
(1) Defining basic fragments
IP
Proteins of the ()8-barrel fold in the Structural Classification of Proteins(SCOP)
SC R
database(Murzin et al., 1995) were collected and only one family member was selected for each protein family. The dataset was further purged to eliminate proteins of above 25% pair-wise sequence identity. This led to a final set containing 108
NU
proteins. The 25% sequence identity requirement served to avoid possible residual redundancies in the dataset and affected the actual dataset size only minimally. The
MA
Protein Data Bank (PDB) (Sussman et al., 1998) IDs are given in Supplementary Table 1. Secondary structures in these proteins were initially assigned automatically
D
with the STRIDE program(Heinig and Frishman, 2004). To avoid inaccurate assignments caused by structural variations, especially to obtain a set of assignments
TE
in which the boundaries of the strands forming the barrels were consistently defined
CE P
in all proteins, a number of the automatically assigned boundaries of the secondary structure elements were manually adjusted, so that the starting residues of the strands fall approximately into the same layer around the barrel. In addition, for the
AC
first residue of every strand, its side chain should point toward the interior of the barrel, and its backbone carbonyl should form a hydrogen bond to a backbone amide of the previous strand. Based on the secondary structure assignments, two sets of super secondary structure fragments were extracted from the intact proteins. The first set comprised all the fragments (referred to as HE fragments). The second comprised the fragments (referred to as EH fragments). Each fragment included two sequentially adjacent segments in regular secondary structures (helix or sheet) and the loop connecting them. In what follows, a fragment will be identified by its type (“HE” or “EH”), its source PDB ID and its sequential number (ranged 1 to 8) in the barrel. We note that all non-terminal secondary structure segments are included twice in the fragment sets, once in the HE set and once in the HE set. In total 769 fragments 5
ACCEPTED MANUSCRIPT and 769 fragments were obtained. (2) Discovering recurring motifs
T
Structure alignment. By usual standards, the fragments within each of the sets are
IP
structurally highly similar to each other. Here we looked at if they can be further
SC R
distinguished into subsets according to their structural similarities. For this purpose, we carried out pair-wise structural alignments with the double dynamic programming (Toh, 1997) algorithm. The following adjustments were made to the algorithm, so that
NU
the resulting alignments based on the local fragments are compatible with alignments based on the global ()8 barrel. Firstly, the same number of (best-aligned) residues
MA
from the helix and from the strand segments are used for structure superposition during double dynamic programming. This is to avoid dominating by the helix
D
segments which contain many more residues than the strand segment. Secondly, all
TE
main chain heavy atoms (N, C, C and O) are considered to inter-position displacement calculations to make sure that the aligned residues have similar
CE P
directions for hydrogen bond formation. The structural similarity between two fragments i and j was scored by the fraction of aligned positions relative to fragment lengths (equation 1).
AC
Here
(1)
is the number of aligned residues, i.e., root mean square deviations main
chain heavy atoms below 2.5Å after structural superposition. The lengths of the fragments are noted as
and
, respectively. This score has a value between 0 and 1.
Sequence similarity. Based on their structure alignment, the sequence similarity between two fragments i and j is defined as the sum of the BLOSUM62 residue similarity scores (B62 ) over the aligned positions, i.e., (2) in which
and
represent the residue types at aligned position l in fragments i
and j, respectively. On average, this similarity score increases with the alignment 6
ACCEPTED MANUSCRIPT length. It thus reflects both structure similarity and sequence similarity. The score in equation (2) is scaled to give a nominal Z-score in equation (3),
T
(3) s.
IP
in which and are average and standard deviation of all possible pairwise
SC R
The scores in (1) and (3) form continuum spectra. They were further regularized to make it easier to identify recurring motifs using modern clustering algorithms such as Affinity Propagation(Frey and Dueck, 2007). The scheme of similarity
NU
regularization has been determined by trial and error, with many small variations on the scheme giving similar but not always exactly the same clustering results. The final
MA
scheme was chosen with a balanced consideration of cluster sizes, stability of clusters with respect to parameter perturbations, and eventually, manual inspection of the major clusters for structure/sequence consistency and for functional relevance (see
D
results below). First, the structural similarity scores obtained by equation (1) are
TE
regularized to values of 0 and 1 based on cutoffs, namely,
CE P
(4)
The cutoff scores
have been chosen so that a quarter of the pairwise scores are
transformed into 1. For the EH fragment set this cutoff is 0.54, while for the HE
AC
fragment the cutoff is 0.67. At these cutoffs, the minimum numbers of aligned positions are 10 for the EH set and 8 for the HE set (Supplementary Figure 1). Second, the regularized
was applied to mask the Z-score in equation (3) to give
the final similarity score for clustering, namely, (5)
Clustering. Affinity Propagation clustering was applied separately to the HE and the EH set with the above similarity matrix C as input. Relatively larger clusters were inspected for pairwise similarities within clusters. It was found that a cluster would occasionally contain a member that has high similarity score only to the representative member of the cluster assigned by the algorithm, but not to other members of the 7
ACCEPTED MANUSCRIPT cluster, while the remaining members showed high mutual similarities. More detailed inspection indicated this was mainly caused by accidental sequence similarity (identity) of the concerned fragment with the cluster representative at positions not
IP
T
conserved in the cluster as a whole. As the similarity scores have been evaluated with a small number of aligned position, such accidental sequence identify may lead to
SC R
high scores between unrelated fragments. To eliminate the effects of such misclassification, we removed cluster members which were close neighbors of less than half of the other members of the cluster. Here by close neighbors we mean
NU
fragment pairs with a masked Z score (equation (5)) of above 1.0. Then a sequence profile was generated for the aligned positions of each cluster, with all cluster
MA
members aligned to the cluster representative. From the profile, highly conserved sequence positions at which more than half of the cluster members have the same
D
residue types were identified. For clusters with 4 or more highly conserved positions,
TE
member fragments having the respective conserved residue types at less than 2 conserved positions were further removed from the cluster.
CE P
With the above protocol, we expected to filter out basic fragments that are most certainly recurring. Longer recurring fragments containing more than one basic fragments were then found by looking at recurring patterns of consecutive fragment
AC
types in the 108 proteins.
2.2. Experimental studies on a designed chimeric protein (1) Design of the chimeric protein We experimented on the idea of engineering artificial proteins using the recurring motifs not as building blocks, but as interfaces, to integrate natural partial domains. Two proteins (PDB: 1YXY and 1THF, respectively) were selected as donors of partial domains. 1YXY is a TIM-fold protein whose structure was entered into PDB without any publication until now and its structural stability has not been reported. 1THF is another TIM-fold protein from the hyperthermophilic species Thermotoga maritima. Thus it is expected to be of strong structural stability. In fact, several successful protein recombination studies have been reported using this protein as fragment 8
ACCEPTED MANUSCRIPT donors (Akanuma and Yamagishi, 2008; Bharat et al., 2008; Eisenbeis et al., 2012; Hocker, 2014; Hocker et al., 2004; Shanmugaratnam et al., 2012). According to the computational analysis, the 3rd and the 7th (αβα) units of these two proteins fall into
IP
T
the same or closely related structure-sequence motifs. Figure 1 illustrates the design of the chimeric protein CBA0 (chimeric barrel A0, or simply A0). Based on the
SC R
locations of the basic fragments of shared (or highly similar) types in this two proteins, the sequence of A0 was designed to be composed of three segments, the first segment from the N terminus to the middle of 3 taken from 1YXY, the second segment from
NU
the middle of 3 to the beginning of 7 taken from 1THF, and the last segment from 7 to the C terminus again taken from 1YXY. A few point mutations from the parent
MA
sequence were introduced: L162A and K196A were for avoiding clash. Moreover, the two parental TIM barrels have somewhat different hydrophobicity at the core of the
D
respective barrels. In 1YXY, the core of the barrel is exclusively hydrophobic, while
TE
in 1THF, the bottom of the barrel (the N-terminal of the strands) is sealed with a ring of polar residues, with the rest of the interior of the barrel being mostly
CE P
hydrophobic. Thus further point mutations, I52V, K92V and E160F were introduced, all within the 1THF half and corresponding to residues forming the bottom ring of the barrel. These mutations replaced the residue types from those of 1THF to those in
AC
1YXY. As a result, the chimeric barrel has a purely hydrophobic interior. The gene coding A0 was synthesized and the protein prepared from recombinant expression (see experimental details) for further characterizations.
9
NU
SC R
IP
T
ACCEPTED MANUSCRIPT
TE
D
MA
Fig.1 Design of the chimeric protein A0 model. The protein 1THF is shown in blue and 1YXY in red. The (αβα)3 and (αβα)7 modules are shown by cartoon and aligned by TM align(Zhang and Skolnick, 2005) . The RMSD between the two proteins’ (αβα)3 module is 1.64 Å (left), and that of (αβα)7 module is 1.39 Å (middle). A new (βα)8-barrel protein model was obtained by combing half barrels cut from the similar (αβα) modules, replacing the (βα)3_7 part of 1YXY by the (βα)3_7 part of 1THF(right). The residues shown in green sticks were mutated: L162A and K196A for avoiding clash, and I59V, K92V and E160F for the hydrophobic unification of the new protein.
(2) Directed evolution of the chimeric protein
CE P
Variants of A0 were obtained through directed evolution aimed at improving protein structural stability. The in vivo directed evolution system of Bardwell and coworkers (Foit and Bardwell, 2013; Foit et al., 2009) was used. In this system, a
AC
protein of interest (POI, here A0 or a random mutant of A0) is inserted into the TEM β-lactamase at a specific position to form a fusion protein, which is expressed in host bacterium cells. As an unstable POI would result in increased proteolysis of the fusion protein, which would in turn lead to reduced/defected cellular β-lactamase activity, the stability of the POI is linked to ampicillin resistance of the host cells. This system was employed to improve the foldability of A0 through directed evolution. In each round of directed evolution, a library was generated with error prone PCR (Cadwell and Joyce, 1994; Cirino et al., 2003) from one or a few templates. The cells containing the library were selected for clones that survived an ampicillin concentration that prohibit growth for bacteria expressing the respective templates. Templates for each round were proteins selected from the previous round. The process was ended until further increase in antibiotics resistance could not be achieved in 10
ACCEPTED MANUSCRIPT several repetitive attempts. (3) Characterization of the chimeric proteins and its variants The chimeric protein and several variants were expressed, purified and
IP
T
characterized for a number of properties. General structure integrity was checked by protease K resistance. Thermal stability was measured with differential scanning
SC R
microcalorimetry (DSC) as well as differential scanning fluorimetry (DSF). Chemical-induced denaturation by guanidine chloride was monitored by circular dichroism spectrum and fluorescence spectroscopy.
NU
(4) Experimental details
Directed evolution. The templates and the mutant libraries were constructed into
MA
LFM10 vector (Foit and Bardwell, 2013; Foit et al., 2009) by ligating PCR products at BamHI and XhoI sites. The templates for error-prone PCR were firstly amplied by
D
PCR using PrimSTAR (TaKaRa). For the first and second rounds of directed
TE
evolution, the primers for both steps are caggatccatgaaaccgacgaaagaaaaactg and aactcgagcctttcagagcttcaataaag. In the third to the sixth rounds, primers for error-prone were
the
adaptor
CE P
PCR
primer
sequences
tgccacctgacgtctaagaa
and
attaccgcctttgagtgagc. To introduce the needed adaptor sequences, amplification by PrimSTAR
with
the
AC
tgccacctgacgtctaagaaggatccatgaaaccgacgaaagaaaaactg
extended
primers and
attaccgcctttgagtgagcctcgagcctttcagagcttcaataaag were used. The ligation reaction product was transformed into NEB 10-beta cells (New England Biolabs). Increased antibiotics resistance of selected colons were confirmed by reconstruction into the original LFM10 vector, retransformation into new NEB 10-beta cells, and antibiotics resistance assay with serial dilution (to eliminate the mutation of the plasmid backbone and bacterial genome).
Protein expression and purification. The DNA encoding the initial designed protein A0 and its variant AR703 from the final round of directed evolution were inserted into a modified pET-22b(+) vector separately. They were expressed as a His-tag fusion 11
ACCEPTED MANUSCRIPT protein in Escherichia coli BL21(DE3) by induction of 1 mM IPTG for 22 h at 16 ℃ (A0) or for 4 h at 37 ℃ (AR703) . The proteins were purified by Ni2+ affinity column chromatography. After purification, dithiothreitol (DTT, final concentration
IP
T
10 mM) or sufficient amount of H2O2 (about 20 times in mol ratio to the amount of protein) was added and incubated on ice for one hour to make the cysteine residues
SC R
reduced or oxidized. Gel-filtration chromatography with Superdex 200 16/60L column or Superdex 200 10/300 GL column (GE Healthcare) were utilized for further purification or analysis. The proteins for further physicochemical characterization
NU
were kept in buffer containing 20 mM Tris-HCl, 300 mM NaCl and 1mM EDTA (pH 8.0) unless otherwise indicated. When necessary, DTT was added (final
MA
concentrations 2-6 mM) in the buffer to keep the protein in reduced state for subsequent studies. The purified proteins were confirmed by SDS-PAGE and the
TE
D
concentrations were determined by the absorbance at 280nm.
Protease resistance assay. Protein sample of 1 mg/ml was incubated with
CE P
1/200000(V/V) proteinase K (TaKaRa, Code No. 9034) for 5 min at 37 ℃. The same reaction without proteinase K was carried out as reference. Digestion was terminated by immediately adding SDS-loading buffer and heating the sample at 100 ℃ for 10
AC
min. SDS-PAGE was carried to check the resistance to proteinase K.
Thermostabilty measurements. DSC experiments were performed on VP-DSC Microcalorimeter (MicroCal Inc.). 3 mg/ml protein was treated with the heating and cooling ramp from 10 ℃ to 85 ℃ at a scanning rate of 1 ℃/min. Origin lab software (MicroCal Inc.) was used to analyze the raw data to obtain a heat capacity profile. DSF measurements were performed on Roche LightCycler480 real-time PCR system with excitation at 455 nm and emission at 580 nm. DSF reaction mixtures composing of 16μM protein and 5× SYPRO Orange dye (Sigma-Aldrich) were heated from 25 ℃ to 80 ℃ at a rate of 2.4 ℃/min. The melting temperature was calculated with built-in software. 12
ACCEPTED MANUSCRIPT Chemical-induced denaturation. The circular dichroism spectra were obtained on a Jasco J-810 spectropolarimeter using a 1 mm path length quartz cuvette at room temperature. Measurements were carried out using 0.2 mg/ml protein samples in 20
T
mM phosphate buffer (pH 7.8) and each spectrum was recorded in the wavelength
IP
range of 260-200 nm. The data at λ=222 nm were followed. The intrinsic fluorescence
SC R
emission spectra were measured on a Shimadzu RF-5301PC spectrofluorophotometer, with an excitation wavelength of 280 nm over a wavelength range of 300-400 nm. An increasing concentration of guanidine hydrochloride was added to 0.4 mg/mg protein
NU
solution and equilibrated overnight at 4 ℃. The fluorescence at 334 nm was
MA
collected.
3. Results and Discussion
TE
D
3.1. The existence of recurring motifs in ()8 barrels. Clustering of the basic fragments led to 82 clusters that covered 397 HE
CE P
fragments and 80 clusters that covered 379 EH fragments with each cluster containing at least three members. For the HE fragments and EH fragments, Supplementary Tables 2 and 3 separately list the details for each cluster (referred to by a unique
AC
numeric ID) size (i.e., number of basic fragments forming the cluster), the averaged similarity score between cluster members, the number of conserved sequence positions across cluster members, as well as a list of all member fragments. A number of properties associated with the clustering results strongly support the existence of recurring HE and EH motifs in ()8 barrel proteins.
13
IP
T
ACCEPTED MANUSCRIPT
NU
SC R
Fig.2 (a) Cluster size distributions. Green: HE fragments, Yellow: EH fragments. (b) Distributions of the number of conserved positions in individual clusters. Green: HE fragments, Yellow: EH fragments. (c) The intra-cluster and inter-cluster sequence Z-score distributions. Yellow: Inter cluster sequence Z-score distribution of fragments in different clusters. Dark Green: Intra cluster sequence z-score distribution of fragments in the same clusters. Light Green: when two fragments are in the same cluster, other fragments in the two proteins are compared to determine the Z-score. (See the main text for details.)
MA
First, the larger the clusters, the less likely that the intra-cluster similarities have appeared by chance. For both the HE and the EH types of fragments, more than 50%
D
of the clusters comprised five or more members (Fig.2a).
TE
Second, more than 90% (80%) of the clusters contain at least one (two) conserved sequence positions (Fig.2b). The distributions in Figure 2b appear to be
CE P
multi-modal, suggesting that while there are a group of clusters that have few (less than five, mostly two or three) conserved positions, there are also a substantial number of clusters that have five or more conserved positions. We note that these
AC
conserved positions have been derived based on the alignments of structures, not sequences. Thus the probability of observing conserved residue types at aligned positions by chance should be small. Third, for most clusters, the mutually similar member fragments in one cluster are mostly from ()8 barrel proteins whose remaining parts are not more similar than fragments from proteins that do not contain any similar fragments. To demonstrate this, for every pair of ()8 barrel proteins that have at least one basic fragments assigned to the same cluster, we analyzed the similarity between the remaining fragments of the two proteins. In this analysis, only the fragments separated by the same number of -units from the shared fragments were paired for comparison. The resulting distribution of pairwise similarity scores is almost the same as that of the scores of unrelated fragment pairs, which is significantly shifted towards the side of 14
ACCEPTED MANUSCRIPT smaller scores as compared with the distribution of intra-cluster similarity scores (Fig.2c). The distributions of intra-cluster and inter-cluster similarity Z-scores are not
IP
T
strictly separable (Fig.2c). Given that the clustering scheme is not perfect, there is the possibility that similar fragments in separate clusters are actually related. Thus in
SC R
footnotes of Supplementary Tables 2 and 3, we list pairs of clusters that are associated with averaged inter-cluster similarity Z scores of above 1.5. For the HE clusters, there are 18 such pairs, most of them associated with a network of clusters that are
NU
relatively densely connected with medium mutual similarity, the network involving clusters 9, 11, 36, 38, 58, 63, and 102. For the EH clusters, only two such pairs were
MA
found.
To find recurring motifs containing multiple basic fragments, we analyzed clusters
D
of basic HE and EH motifs that co-occurred at consecutive positions in different ()8
TE
barrels. Cluster combinations that co-occurred in at least two proteins have been listed, together with the locations of the respective basic fragments in containing proteins
CE P
(Supplementary Table 4). Compared with the number of clustered basic fragments, there are relatively few recurrent extended motifs of multiple basic fragments. This indicates that combinations between the basic and units can vary a lot, and that
AC
there may be only a few specific combinations of basic fragments which are particularly favored. One of the most frequent extended motif is the motif composed of HE66-EH64, the loop in this motif corresponds to the well-known phosphate binding motif in ()8 barrels (see below). Besides this motif, many of the recurrent extended motifs (Supplementary Table 4) contained the basic motifs that are associated with the coordination of metal ions, including motifs EH54, EH97 and EH72 (see below). This is probably because metal coordination requires cooperative interactions involving residues on multiple or units. 3.2. Conserved residues in recurring HE motifs mainly play structural roles. The conserved positions and respective conserved residue types within the clusters were inspected in detail. For the clusters that show strong sequence conservation, 15
ACCEPTED MANUSCRIPT
MA
NU
SC R
IP
T
important sequence-structure or sequence-function relationships could be revealed.
TE
D
Fig.3 Examples of HE clusters. The sequence logos are shown left and their superimposed structures are shown right. The cluster members can be found in Supplementary Table 2. The conserved sites discussed in the main text are shown in red with their sidechains displayed.
As examples, the sequence logos and superimposed structures of four HE clusters
CE P
that have the highest intra-cluster sequence similarity Z scores are given in Fig.3. Each of these clusters contains at least five members. For three of them, namely, clusters HE102, HE36 and HE63, the conserved positions are concentrated in the
AC
loop region. All three clusters have a conserved glycine at the beginning of the loop, and exhibit a two residue “GA” pattern comprising this glycine and an alanine next to it. Because of its small side chain, the glycine may signal the ending of the helix, promoting a sharp turn of the peptide backbone at this position. As the most common residue type, alanine at the next position also contains a small side chain. In clusters HE102 and HE 63, the position right after the “GA” pattern is also highly conserved to be an aspartate. As the side chain of this aspartate invariantly forms a hydrogen bond with the N-terminal backbone of another neighboring -strand, it may be important for the stability of the overall ()8 barrels, as suggested and experimentally tested previously(Nagano et al., 2002). In cluster HE102, the “GAD” sequence pattern is consecutively extended by another conserved glycine. Cluster 16
ACCEPTED MANUSCRIPT HE36 differs from clusters HE102 and HE63 in that it presents only the “GA” but not the “GAD” or “GADG” sequence patterns in the loop. However, at the position one residue away from the “GA” pattern, cluster HE36 shows strong preferences for
IP
T
residues of large aromatic side chains.
The other example, HE74, exhibits a different pattern of sequence conservation
SC R
(Fig.3). Peptide leaves the loop and enters the strand with a three-residue “IPV” sequence motif. In addition, the large hydrophobic side chains of the conserved isoleucine and valine in this motif are from buried hydrophobic cluster with conserved
NU
leucine/valine sidechains from the N-terminal of the helix. We note that HE74 has a relatively wide spread of structures among its members.
MA
The above examples suggest that the conserved residues in recurring motifs mainly play structural roles. It is consistent with the observation that inside the ()8
D
barrels, the active sites are usually not located near the C-terminus of the inner
TE
-strand barrel, that is, close to the loops but not the loops. 3.3. Conserved residues in recurring EH motifs may play important functional
CE P
roles.
There are 45 EH clusters in total that each has at least four members, three conserved positions and intra-cluster similarity Z-score above 1.5. Manual inspection
AC
of these clusters suggested that the positions and types of the conserved residues in these EH clusters are much more diverse than those in the HE clusters. In different clusters, conserved residues of varied conserved amino acid types are found in different regions within the basic strand- loop- helix framework, while for a particular cluster its conserved positions can be concentrated in one or two regions. Furthermore, we focus on EH clusters that have conserved residues with polar sidechains in the C terminal region of the strand or in the nearby loop. Compared with other clusters, these clusters or motifs are more likely to be associated with functions, because the active sites of intact ()8 barrel proteins are usually formed by residues in this region. In addition, the conserved residues forming an active site usually include those with polar side chains. The sequence logos and superimposed structures for seven such clusters are shown (Fig.4). In Supplementary 17
ACCEPTED MANUSCRIPT Figure 2, we show as examples the sequence logos and superimposed structures of two clusters (cluster 2 and 93) whose conserved residues are located not in the designated region or not of polar sidechain types. Such motifs may play important
AC
CE P
TE
D
MA
NU
SC R
roles based on their sequence conservation patterns.
IP
T
structural roles in ()8 barrel proteins, while it is difficult to discuss their functional
Fig.4 Examples of EH clusters. The sequence logos are shown left and their superimposed structures are shown right. The cluster members can be found in Supplementary Table 3. The conserved sites discussed in the main text are shown in red with their sidechains displayed.
As shown in Fig. 4, cluster EH64 corresponds to the well-known phosphate binding motif. The sequence logo of this cluster shows that to varied extents, sequence conservation is exhibited throughout the entire fragment, with the most conserved residues being at the N and C terminal ends of the strand. Cluster EH54, exhibits strong sequence conservation in the strand. An invariant Asp can be found at the C-terminal end of the strand. Inspection of the PDB structures of proteins containing the member fragments of EH54 revealed that this Asp takes part in coordination with a divalent metal ion, such as iron, nickel or zinc. 18
ACCEPTED MANUSCRIPT Cluster EH97 has a conserved DXHXH (the ‘X’ represents variable residue types) sequence motif that spans the strand. In corresponding proteins the two histidine residues in this motif participate in divalent metal (namely, zinc or nickel)
IP
T
coordination, while the sidechain of the C-terminal Asp residue seems to play a structural role by forming a hydrogen bond with the backbone of a neighboring strand.
SC R
Another conserved feature of fragments in this cluster is the glycine at the C-terminal end of the -helix.
Cluster EH72 has a conserved DXXH motif. The loop has the form of a short
NU
helix in this cluster. The conserved Asp is at the C-terminal end of the strand while the conserved His is at the beginning of the short helix. Again, these two residues take
MA
part in divalent metal ion (zinc or magnesium) binding in the containing proteins. Most of the conserved positions for clusters EH35, EH79 and EH15 are located on
D
the helix, especially at the C-terminal end of the helix. Such locations and the
TE
conserved residue types (for examples, Ala and Gly at the C-terminal of EH35) suggested structural roles of these conserved residues. However, each of these clusters
CE P
shows a moderately conserved polar residue on the -strand that might play functional roles. In cluster EH35, this is a conserved Asp at the C-terminal end of the strand and pointing towards the interior of the ()8 barrel. This Asp seems to play important
AC
but varied functional roles in different containing proteins. For examples, it coordinates with a magnesium in the protein structure 1F61, while forms a hydrogen bond with a small molecule ligand in the protein structure 3BOF. In cluster EH79, an intermediately conserved histidine is also located at the C-terminal end of the strand. When present, it is also observed to interact directly with small molecule or metal ion ligands in containing proteins. In cluster EH35, a moderately conserved histidine is found in the middle of the strand. Probably because it is deeper inside the barrel compared to the conserved His in EH79, this histidine is generally not observed to directly interact with ligands in containing proteins. However, its side chain participates in hydrogen bonding networks with polar sidechains from neighboring strands, sometimes bridged by buried water molecules. Together these groups may form polar bottoms of active site pockets, presumably providing favorable 19
ACCEPTED MANUSCRIPT environments for the binding of polar ligands by respective containing proteins. 3.4. The chimeric barrel A0 is soluble but not in a well-folded state
T
To test the idea of engineering artificial proteins using the recurring motifs
IP
discussed above as interfaces to integrate natural partial domains, a chimeric protein A0 has been designed based on proteins 1YXY and 1THF. The combined new protein
SC R
A0 is soluble when expressed at 16 ℃with a C-terminal His-tag and showed one main peak when purified by gel filtration (Fig.6c). CD spectrum shows that a significant
NU
portion of A0 forms secondary structures consistent with / proteins (Supplementary Figure 4). The unfolding curves follow a two state model (Fig.7b-d). These all suggest
MA
that the new barrel protein A0 is partially folded. However, A0 is slightly degraded after purification (Fig.6e) and the gel filtration results with higher efficiency column have two main peaks (Fig.6a). In addition, differential scanning calorimetry (DSC)
D
fails to present a sharp peak (Fig.7a). When inserted into the β-lactamase tripartite
TE
fusion system (Foit and Bardwell, 2013; Foit et al., 2009), the resulting ampicillin
state.
CE P
resistance is low (Fig.5a). These indicate that A0 is probably not in a well-folded
3.5. Directed evolution of A0 led to a well-folded (βα)8-barrel protein
AC
After selection on plates with increasing concentrations of ampicillin, mutants of A0 with potentially improved stability and foldability were generated. They were subjected to successive rounds of directed evolution implied by the improved ampicillin resistance (Fig.5 and Supplementary Table 5). After six rounds, AR703, one of the mutants that led to the highest ampicillin resistance, was picked for further characterization. The mutated amino acids contained in AR703 are shown on the A0 structure model (Fig.5h). R229C showed very high frequency in the earlier rounds, and it might lead to disulfide bond formation in the relatively oxidative environment of the periplasmic space where the beta-lactamase functions. Thus AR703 was characterized in a reductive as well as in an oxidative environment (named AR703_red and AR703_ox, respectively) after purification with the Ni2+ column.
20
NU
SC R
IP
T
ACCEPTED MANUSCRIPT
AC
CE P
TE
D
MA
Fig. 5 Directed evolution of A0. (a) The effect of directed evolution was tested by spot titer experiment on a plate containing 0.4mg/ml ampicillin. A0 is the original protein, and AB009, AF10, AK102, AP408, AQ803 and AR703 are the best mutants from the first to the sixth rounds of directed evolution. (b-g) The sites mutated in each round are shown by sticks. The 1YXY part is shown in red and the sites of this part mutated in each round are shown in green. The 1THF part is shown in blue and the sites of this part mutated in each round are shown in yellow. The mutation sites mutated in the previous rounds are shown in cyan. (h) The AR703 mutant from the final round was picked for further characterization. The sites mutated in 1YXY part are I16N, L23P, Y29D, E31D, M36I, G47D, V56D, V216E, G217A, and R229C (shown in green sticks). The sites mutated in 1THF part are G74V, G75D, F131Y, I166N, and P185S (shown in yellow sticks).
21
MA
NU
SC R
IP
T
ACCEPTED MANUSCRIPT
CE P
TE
D
Fig.6 The gel filtration, solubility and proteinase resistance results. (a) The gel filtration curve of AR703 mutant was significantly improved, which turned into a sharp mono peak in in comparison to the original protein A0 (both in reduction state). (b) The oxidized AR703 shows a more packed state according to its larger elution volume, implying the folding back of the α helix containing the R229C mutation. (c) However, the original A0 shows only one wide main peak during the purification, which may due to the lower efficiency of the column. (d) Positions of residues R229 and C20 (green sticks) in a structural model generated by simply merging the two half structures
AC
from 1YXY (red) and 1THF (blue), respectively. (e) When expressed at 37℃, AR703 was totally soluble while the original A0 was not. (‘w’ stands for whole cell and ‘s’ stands for supernatant). When treated with proteinase K for 5min, AR703 was less digested (‘0 min’ stands for no treatment with proteinase K). And AR703 was also less automatically degraded during the expression and purification.
(1) AR703 has been significantly improved into a well-folded protein This is supported with a range of evidence. Firstly, the solubility and resistance to proteolysis are apparently improved. The original A0 is not soluble when expressed at 37 ℃, but AR703 was soluble under the same condition (Fig.6e). Although A0 is soluble when expressed at 16 ℃, it was insoluble when constructed into a modified pet28a vector with His-tag at both terminals to prevent the slight degradation. On the other hand, AR703 is again soluble when expressed on a pet28a vector at 37 ℃ (data not shown). According to the SDS-PAGE result, AR703 is less degraded during the expression or purification and is more resistant to protease treatment, indicating it is 22
ACCEPTED MANUSCRIPT structurally more stable. (Fig.6e) Secondly, the dispersion state of AR703 in solution is improved relative to A0 according to gel filtration. The gel filtration results for A0 has two peaks (Fig.6a)
IP
T
although it showed only one peak during purification by a Superdex200 16/60 column (Fig.6c), probably due to the higher efficiency of an analytical column. One of the
SC R
parent proteins of A0, 1YXY, is in a domain swapped dimeric state. A0 has inherited the partial barrel that participates this interaction, and the similar domain swapped interactions might account for the apparently aggregated state (Fig.6d), although we
NU
could not yet provide experimental evidence for this hypothesis. On the other hand, the evolved AR703 shows only one sharp peak, indicating improvements in molecular
AC
CE P
TE
D
MA
disperse state.
Fig.7 The thermal and chemical unfolding curves. (a) The DSC results showed the significant improvement of the AR703 mutant and the further improvement by the formation of disulfide bond. (b) DSF results showed similar results with DSC. (c) The mean residual ellipticity (MRE) at 222nm was collected after the equilibrium with different concentrations of GuHCl. (d) The unfolding curve was also obtained by collecting tryptophan fluorescence at 334nm excited at 280nm. The GuHCl equilibrated unfolding also shows the increased stability of the AR703 mutant. 23
ACCEPTED MANUSCRIPT
Thirdly, the thermostability of AR703 has been significantly improved. The melting temperatures of A0 and AR703_red determined by DSC are 45.8 ℃ and
T
58.1 ℃ , respectively. AR703_ox is even more thermostable, with a melting
IP
temperature of 66.7 ℃ (Fig.7a). The DSF results are in good consistence with DSC
SC R
results, with the melting temperatures detected being 50.5 ℃, 63.7 ℃ and 72.0 ℃, for the molecules, respectively (Fig.7b).
Finally, the resistance of folded AR703 to denaturants is also increased as
NU
compared with A0. According to the unfolding curve detected by circular dichroism, AR703 is much more stable than the original A0, AR703_ox being also more stable
MA
than the reduced AR703_red (Fig.7c). The unfolding curve detected by fluorescence gives similar results (Fig.7d).
Possible causes of the significant differences between AR703_ox and AR703_red
D
worth further discussions. The only two cysteine residues available to form an
TE
intra-peptide chain disulfide bond are R229C and C20. If we built a structural model
CE P
of the chimera just by putting together the structures of the parent proteins together, R229C would be too far from C20 for forming a disulfide bond, this is because in the structure of the parent protein 1YXY, the C-terminal helix containing R229 extends
AC
away from the main body of the protein and participates in domain-swapped interactions with another peptide chain (See Fig.6d). As AR703 is a monomer, that helix should no longer be interacting with another peptide chain in this evolved chimeric protein. On the other hand, the existence of a disulfide bond under the oxidative condition is strongly supported by the significant differences between AR703_ox as AR703_red. In addition, for the disulfide bond to form, R229C should be brought close to C20. This could be made possible if, relative to its position in the parental 1YXY structure, the helix containing R229C folds back towards the main body of the protein. The suggested change would make the protein structure to be more compact. In support of this, the apparent molecular size of AR703_ox determined by gel filtration was determined to be smaller than AR703_red (Fig.6b). These data strongly indicate that AR703, especially its non-reduced form, form a 24
ACCEPTED MANUSCRIPT well-folded globular structure, although we cannot provide an atomic model of it because efforts to obtain crystals of AR703 has been unsuccessful till now.
IP
T
(2) The partial barrels modules are of different intrinsic stabilities
The mutations encountered in the directed evolution rounds are listed in
SC R
Supplementary Table 5. While it is not strictly true to interpret the effects of the individual mutations as always being positive because neutral or not so harmful mutations could have been accumulated during directed evolution, the collection of
NU
mutations at each round of the six rounds of directed evolution could be considered as adaptively improving. Because of this and also because relatively few mutations were
MA
added at each round, the majority of the finally accumulated mutations should have positive effects.
D
In terms of sequence positions, the mutations were mainly found to be within the
TE
two partial barriers instead of within the super-secondary structure motifs at the interfaces. There is a clear asymmetry in the distribution of mutations inside the two
CE P
partial barrels, especially in the earlier rounds: selected mutations are significantly concentrated on the partial barrel inherited from 1XYX. This indicates that the starting partial barrel from 1THF is by itself quite stable and difficult to improve,
AC
most random mutations tending to be disruptive. To further look into the folding status of this partial barrel in the starting and the evolved chimeric protein, the fluorescence of the single tryptophan in the chimeric proteins was measured. This tryptophan locates well inside the core of the 1THF partial barrel. The maximum emission wavelengths of A0, AR703_red and AR703_ox are exactly the same (Supplementary Figure 3), suggesting similar environments of the tryptophan (in 1THF part) in these different proteins or protein forms. This supported that the 1THF partial barrel part was well-folded in the designed chimeric protein, despite that this part is only a partial not an intact structural domain of the parent protein. This observation suggests that engineering new proteins from partial domains is a viable approach. It is interesting to note that while R229C has been selected with very high 25
ACCEPTED MANUSCRIPT frequency in the first round of selection (it has been confirmed that none of the nine colonies randomly picked before the selection contains R229C mutation), the final AR703_ox indeed folds much better than AR703_red. A possible explanation is that
IP
T
R229C may lead to intra-molecular disulfide bond with the only cysteine (C20), leading to a compact structure that could be less likely to aggregate. This is supported
SC R
by gel filtration for the smaller molecular size of AR703_ox in comparison with AR703_red (Fig.6c). In addition, this disulfide bond contributed significantly to the stability according to the DSC and DSF results (Fig.7ab). This may imply an
NU
important function for newly formed intra-molecular disulfide bond during protein evolution: to turn a loosely packed structure into a compact one with as few as only a
MA
single mutation. Subsequent evolution may further optimize the packing. Although the folded tertiary structure should be much better formed and much
D
more stable in AR703 than in the starting chimeric A0, the secondary structure
TE
compositions of both proteins may still be quite similar to that of the parent native proteins. The circular dichroism spectrums of A0 and the different forms of AR703
CE P
proteins are only slightly different (Supplementary Figure 4). It will not be unexpected that A0 has inherited most of the secondary structures of the parent proteins, and the directed evolution has mainly optimized the packing between
AC
secondary structures.
4. Conclusion Engineering new proteins using partial domains from natural proteins as building blocks is an attractive approach for protein engineering as well as for deciphering protein evolution. Although various bioinformatics approaches have been proposed to segment natural proteins into substructures for various purposes, only a small number of structural or bioinformatics analysis strategies have been proposed to suggest building parts and assembling strategies from a protein engineering perspective (Bauer et al., 2006; Endelman et al., 2004; Pantazes et al., 2007; Silberg et al., 2004; Voigt et al., 2002). 26
ACCEPTED MANUSCRIPT In this work, we used a structure alignment-based sequence clustering approach to systematically analyze the and motifs in TIM barrel proteins. A number of recurring sequence-structure motifs have been identified. Besides looking for their
IP
T
implications in sequence-structure-function relationships of TIM barrels, we aimed at exploiting these motifs for protein engineering. Here we explored an engineering
SC R
strategy of using a recurring motif as a conserved interface to fit two partial barrels together. The initial chimeric protein A0 designed by this strategy is soluble but not well-folded. However, directed evolution by a -lactamase tripartite fusion system
NU
was able to significantly optimize the folding. Eventually, after six rounds of directed evolution, a well-folded mutant AR703 was obtained. Our results suggest that with
MA
reasonable optimization efforts using directed evolution, novel well-folded chimeric proteins can be feasibly obtained based on systematic bioinformatics analysis of
D
natural proteins. On the other hand, the initially designed A0 being a relatively poor
TE
folder suggests that new analysis methods and design strategies may still need to be explored to increase the design success rate and to reduce the burden on experimental
CE P
optimization.
Acknowledgements
AC
We would like to thank Dr. James Bardwell for the plasmid TEM1-β-lactamase, Yanwei Ding for help in the DSC experiments, Zhenhua Shao, Kai Yang, Zexian Liu, Wei Yan, Jian Zhan and Wei Zhao for helps and discussions. J.W. thanks Deqing Ma for the encouragement and wishes her a happy wedding. This work has been supported by grants from the National Natural Science Foundation of China (31200546, 31470717 to Q.C. and 31370755 to H.L.).
References Bauer, D.C., Boden, M., Thier, R., and Gillam, E.M. (2006). STAR: predicting recombination sites from amino acid sequence. Bmc Bioinformatics 7. Blaber, M., and Lee, J. (2012). Designing proteins from simple motifs: opportunities in Top-Down Symmetric Deconstruction. Curr Opin Struct Biol 22, 442-450. Brändén, C.-I. (1991). The TIM barrel—the most frequently occurring folding motif in proteins: Current 27
ACCEPTED MANUSCRIPT Opinion in Structural Biology 1991, 1:978–983. Curr Opin Struc Biol 1, 978-983. Broom, A., Doxey, A.C., Lobsanov, Y.D., Berthin, L.G., Rose, D.R., Howell, P.L., McConkey, B.J., and Meiering, E.M. (2012). Modular evolution and the origins of symmetry: reconstruction of a three-fold symmetric globular protein. Structure 20, 161-171.
T
Cadwell, R.C., and Joyce, G.F. (1994). Mutagenic PCR. PCR Methods Appl 3, S136-140. Cirino, P.C., Mayer, K.M., and Umeno, D. (2003). Generating mutant libraries using error-prone PCR.
IP
Methods Mol Biol 231, 3-9.
Dahiyat, B.I., and Mayo, S.L. (1997). De novo protein design: fully automated sequence selection.
SC R
Science 278, 82-87.
Eisenbeis, S., Proffitt, W., Coles, M., Truffault, V., Shanmugaratnam, S., Meiler, J., and Hocker, B. (2012). Potential of fragment recombination for rational design of proteins. J Am Chem Soc 134, 4019-4022. Endelman, J.B., Silberg, J.J., Wang, Z.G., and Arnold, F.H. (2004). Site-directed protein recombination as
NU
a shortest-path problem. Protein Engineering Design & Selection 17, 589-594. Farias-Rico, J.A., Schmidt, S., and Hocker, B. (2014). Evolutionary relationship of two ancient protein superfolds. Nat Chem Biol 10, 710-715.
MA
Foit, L., and Bardwell, J.C. (2013). A tripartite fusion system for the selection of protein variants with increased stability in vivo. Methods Mol Biol 978, 1-20.
Foit, L., Morgan, G.J., Kern, M.J., Steimer, L.R., von Hacht, A.A., Titchmarsh, J., Warriner, S.L., Radford, S.E., and Bardwell, J.C. (2009). Optimizing protein stability in vivo. Mol Cell 36, 861-871.
D
Fortenberry, C., Bowman, E.A., Proffitt, W., Dorr, B., Combs, S., Harp, J., Mizoue, L., and Meiler, J.
TE
(2011). Exploring symmetry as an avenue to the computational design of large protein domains. J Am Chem Soc 133, 18026-18029. 972-976.
CE P
Frey, B.J., and Dueck, D. (2007). Clustering by passing messages between data points. Science 315, Heinig, M., and Frishman, D. (2004). STRIDE: a web server for secondary structure assignment from known atomic coordinates of proteins. Nucleic Acids Res 32, W500-502. Hocker, B., Beismann-Driemeyer, S., Hettwer, S., Lustig, A., and Sterner, R. (2001). Dissection of a
AC
(betaalpha)8-barrel enzyme into two folded halves. Nature structural biology 8, 32-36. Huang, P.S., Feldmeier, K., Parmeggiani, F., Fernandez Velasco, D.A., Hocker, B., and Baker, D. (2016). De novo design of a four-fold symmetric TIM-barrel protein with atomic-level accuracy. Nat Chem Biol 12, 29-34.
Jacobs, T.M., Williams, B., Williams, T., Xu, X., Eletsky, A., Federizon, J.F., Szyperski, T., and Kuhlman, B. (2016). Design of structurally distinct proteins using strategies inspired by evolution. Science 352, 687-690. Koga, N., Tatsumi-Koga, R., Liu, G., Xiao, R., Acton, T.B., Montelione, G.T., and Baker, D. (2012). Principles for designing ideal protein structures. Nature 491, 222-227. Kuhlman, B., Dantas, G., Ireton, G.C., Varani, G., Stoddard, B.L., and Baker, D. (2003). Design of a novel globular protein fold with atomic-level accuracy. Science 302, 1364-1368. Lang, D., Thoma, R., Henn-Sax, M., Sterner, R., and Wilmanns, M. (2000). Structural evidence for evolution of the beta/alpha barrel scaffold by gene duplication and fusion. Science 289, 1546-1550. Li, Z., Yang, Y., Zhan, J., Dai, L., and Zhou, Y. (2013). Energy functions in de novo protein design: current challenges and future prospects. Annual review of biophysics 42, 315-335. Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247, 536-540. 28
ACCEPTED MANUSCRIPT Nagano, N., Orengo, C.A., and Thornton, J.M. (2002). One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions. J Mol Biol 321, 741-765. Nikkhah, M., Jawad-Alami, Z., Demydchuk, M., Ribbons, D., and Paoli, M. (2006). Engineering of
T
beta-propeller protein scaffolds by multiple gene duplication and fusion of an idealized WD repeat. Biomol Eng 23, 185-194.
IP
Pantazes, R.J., Saraf, M.C., and Maranas, C.D. (2007). Optimal protein library design using recombination or point mutations based on sequence-based scoring functions. Protein Engineering
SC R
Design & Selection 20, 361-373.
Richter, M., Bosnali, M., Carstensen, L., Seitz, T., Durchschlag, H., Blanquart, S., Merkl, R., and Sterner, R. (2010). Computational and experimental evidence for the evolution of a (beta alpha)8-barrel protein from an ancestral quarter-barrel stabilised by disulfide bonds. J Mol Biol 398, 763-773.
NU
Saab-Rincon, G., Olvera, L., Olvera, M., Rudino-Pinera, E., Benites, E., Soberon, X., and Morett, E. (2012). Evolutionary walk between (beta/alpha)(8) barrels: catalytic migration from triosephosphate isomerase to thiamin phosphate synthase. J Mol Biol 416, 255-270.
MA
Silberg, J.J., Endelman, J.B., and Arnold, F.H. (2004). SCHEMA-guided protein recombination. Protein engineering 388, 35-42.
Soberon, X., Fuentes-Gallego, P., and Saab-Rincon, G. (2004). In vivo fragment complementation of a (beta/alpha)(8) barrel protein: generation of variability by recombination. FEBS Lett 560, 167-172.
TE
peptides. Bioessays 25, 837-846.
D
Soding, J., and Lupas, A.N. (2003). More than the sum of their parts: on the evolution of proteins from Sullivan, B.J., Durani, V., and Magliery, T.J. (2011). Triosephosphate isomerase by consensus design: dramatic differences in physical properties and activity of related variants. J Mol Biol 413, 195-208.
CE P
Sussman, J.L., Lin, D., Jiang, J., Manning, N.O., Prilusky, J., Ritter, O., and Abola, E.E. (1998). Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta crystallographica Section D, Biological crystallography 54, 1078-1084. Toh, H. (1997). Introduction of a distance cut-off into structural alignment by the double dynamic
AC
programming algorithm. Comput Appl Biosci 13, 387-396. Tomii, K., Sawada, Y., and Honda, S. (2012). Convergent evolution in structural elements of proteins investigated using cross profile analysis. BMC Bioinformatics 13, 11. Voigt, C.A., Martinez, C., Wang, Z.G., Mayo, S.L., and Arnold, F.H. (2002). Protein building blocks preserved by recombination. Nature structural biology 9, 553-558. Watanabe, H., Yamasaki, K., and Honda, S. (2014). Tracing primordial protein evolution through structurally guided stepwise segment elongation. J Biol Chem 289, 3394-3404. Xiong, P., Wang, M., Zhou, X., Zhang, T., Zhang, J., Chen, Q., and Liu, H. (2014). Protein design with a comprehensive statistical energy function and boosted by experimental selection for foldability. Nat Commun 5, 5330. Yadid, I., and Tawfik, D.S. (2007). Reconstruction of functional beta-propeller lectins via homo-oligomeric assembly of shorter fragments. J Mol Biol 365, 10-17. Zhang, Y., and Skolnick, J. (2005). TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 33, 2302-2309.
29
ACCEPTED MANUSCRIPT Conflict of Interest: JA and JAS hold patent (CA2757917A1) rights to some applications of TPP-IOA
AC
CE P
TE
D
MA
NU
SC R
IP
T
(which are not supported by these data!).
30
ACCEPTED MANUSCRIPT Highlights The and modules of TIM barrel proteins were clustered based on structure and sequence.
T
A number of recurring motifs have been identified.
AC
CE P
TE
D
MA
NU
SC R
A0 was significantly improved by six rounds of directed evolution.
IP
A chimeric protein A0 was created by using the recurring motifs as interfaces.
31