Identifying protein domains by global analysis of soluble fragment data

Identifying protein domains by global analysis of soluble fragment data

Analytical Biochemistry 465 (2014) 53–62 Contents lists available at ScienceDirect Analytical Biochemistry journal homepage: www.elsevier.com/locate...

2MB Sizes 0 Downloads 26 Views

Analytical Biochemistry 465 (2014) 53–62

Contents lists available at ScienceDirect

Analytical Biochemistry journal homepage: www.elsevier.com/locate/yabio

Identifying protein domains by global analysis of soluble fragment data Esther M.M. Bulloch ⇑, Richard L. Kingston School of Biological Sciences, University of Auckland, New Zealand

a r t i c l e

i n f o

Article history: Received 17 February 2014 Received in revised form 17 June 2014 Accepted 25 June 2014 Available online 10 July 2014 Keywords: Protein expression Protein domains Gene fragmentation Solubility screen Domain mapping Cluster analysis

a b s t r a c t The production and analysis of individual structural domains is a common strategy for studying large or complex proteins, which may be experimentally intractable in their full-length form. However, identifying domain boundaries is challenging if there is little structural information concerning the protein target. One experimental procedure for mapping domains is to screen a library of random protein fragments for solubility, since truncation of a domain will typically expose hydrophobic groups, leading to poor fragment solubility. We have coupled fragment solubility screening with global data analysis to develop an effective method for identifying structural domains within a protein. A gene fragment library is generated using mechanical shearing, or by uracil doping of the gene and a uracil-specific enzymatic digest. A split green fluorescent protein (GFP) assay is used to screen the corresponding protein fragments for solubility when expressed in Escherichia coli. The soluble fragment data are then analyzed using two complementary approaches. Fragmentation ‘‘hotspots’’ indicate possible interdomain regions. Clustering algorithms are used to group related fragments, and concomitantly predict domain location. The effectiveness of this Domain Seeking procedure is demonstrated by application to the well-characterized human protein p85a. Ó 2014 Elsevier Inc. All rights reserved.

Biochemical, biophysical, and structural analysis of proteins requires significant amounts of material. Despite the continual development of heterologous expression systems [1], obtaining sufficient quantities of a protein in a correctly folded and soluble form is often difficult, particularly for complex eukaryotic proteins. Fortunately, many large proteins are modular in nature and composed of multiple structural domains: semiautonomous regions of the polypeptide that have the capacity to fold in isolation. The individual domains may be easier to express and purify than the full-length protein, and their characterization can provide critical insights into protein function. The challenge is to identify the boundaries of these structural domains. If trace amounts of a full-length protein can be isolated, limited enzymatic proteolysis is a useful and well-validated experimental technique for identifying domain boundaries [2]. Alternatively, domain boundaries can be inferred from the protein sequence, using varying bioinformatic approaches (see e.g., [3]). Based on such in silico analyses, the expression of multiple constructs with slightly differing termini is typically evaluated [4,5]. This approach is embedded in the workflows of many structural genomics consortia (see e.g., [3,6]). While these methods are unquestionably successful, there remain situations where they are difficult to ⇑ Corresponding author. Address: School of Biological Sciences, University of Auckland, Private Bag 92019, Auckland 1142, New Zealand. E-mail address: [email protected] (E.M.M. Bulloch). http://dx.doi.org/10.1016/j.ab.2014.06.021 0003-2697/Ó 2014 Elsevier Inc. All rights reserved.

apply. Some proteins cannot be expressed in full-length form, even in trace amounts, or have very limited sequence similarity with previously characterized proteins, weakening the structural inferences that can be made. Over the past decade some alternative experimental approaches for identifying structural domain boundaries have been developed. Although the exact methodology varies greatly, the basic strategy is to express a random library of protein fragments and screen these for solubility in a high-throughput manner [7–11]. This approach is successful because fragmentation within a structural domain will generally expose hydrophobic amino acids sequestered in the domain interior, giving rise to conformationally unstable fragments with limited solubility. Studying the expression, stability, and solubility of fragments can therefore yield information about the structural domains embedded within a complex protein. Methods used to fragment the target gene include limited exonuclease and/or endonuclease digest [12–15], mechanical shearing [14,16], PCR1 with random primers [17], and uracil-doped PCR followed by a uracil-specific enzymatic digest [10,18,19]. Each of these

1 Abbreviations used: BCR, breakpoint cluster region-homology; bp, base pairs; diethylaminoethyl EDTA; ethylenediaminetetraacetic acid; GFP, green fluorescent protein; IPTG, isopropyl b-D-1-thiogalactopyranoside; LB, Lysogeny Broth; MBP, maltose binding protein; PCR, polymerase chain reaction; Pfu, Pyrococcus furiosus; SH, src-homology.

54

A protocol for locating protein domains / E.M.M. Bulloch, R.L. Kingston / Anal. Biochem. 465 (2014) 53–62

fragmentation techniques has some degree of sequence and/or positional bias but often the effect on the fragment library composition can be minimized by careful experimental design. Depending on the methods of fragmentation and cloning used, the probability of a gene fragment being in the correct open reading frame in the subsequent expression and solubility assay varies from 1/3 to 1/18. Potential bias in the protein fragments screened due to open reading frame selection can be reduced by using a mixture of nine different frame-shift vectors [18]. In addition, methods have been developed to select only fragments that are in the correct open reading frame prior to solubility screening [14,20,21], substantially increasing the efficiency of the screening process. There are also several methods for medium to high-throughput solubility screening of protein fragments when expressed in Escherichia coli. Fluorescence screens have been developed based on fusing fragments to GFP. Of particular note is the split-GFP assay in which fragments are expressed fused to the last strand of the b-barrel of GFP (GFP11) [22,23]. If the fragment is soluble then GFP11 is available to bind to the nonfluorescent remainder of GFP (GFP1–10) when it is subsequently introduced, reconstituting GFP fluorescence. In the CoFi (colony filtration) method, fragments are expressed with a short tag, cells are lysed, and soluble proteins are transferred to a membrane that is immunochemically probed for tagged protein [13,15]. The ESPRIT (expression of soluble proteins by random incremental truncation) method is similar but fragments are expressed with tags at both ends for immunochemical probing and clones are validated with high-throughput expression and purification trials [12,21]. Life/death colony assays for solubility have also been developed, in which fragments are fused to proteins that confer antibiotic resistance [19]. However, a disadvantage of any approach where fragments are fused to a large protein is that these may exert a significant carrier effect on otherwise insoluble fragments, leading to false positive results. Although the published methodologies have been used successfully to map protein domains [7,19], these procedures have not been widely adopted. This may be due to the labor-intensive nature of some screens or the need for robotics to carry out highthroughput screening of clones. If more widely implemented, these random fragmentation and screening techniques have the potential to allow many new proteins to be studied at a molecular level. In this study we investigated whether simple, mediumthroughput and low-cost fragmentation, screening, and analysis protocols could be combined to identify domain boundaries. We adapted the split-GFP solubility assay, developed by Waldo and co-workers [22,23], to screen gene fragment libraries created with two different methods (Fig. 1). In contrast to previous studies, we did not focus on directly optimizing fragment solubility through a multistep screening process. Instead we globally analyzed the fragment solubility data to infer the domain boundaries. Here we illustrate the effectiveness of this Domain Seeking methodology by applying it to a structurally characterized protein, human p85a [24–33]. Materials and methods Reagents The vector pBAD-MCS was supplied by the Protein Purification and Expression Facility at the European Molecular Biology Laboratory, Heidelberg. The pET_GFP1–10 plasmid for the split-GFP assay was a kind gift from Geoff Waldo at the Los Alamos National Laboratory (USA). DNA primers were obtained from Integrated DNA Technologies. The gene for human p85a, codon optimized for expression in Escherichia coli, was synthesized by GeneArt (Germany) and supplied in the vector pMK. DNA was extracted/ purified using Nucleospin Gel and PCR clean-up kits (Macherey

Nagel, Germany). Chemical reagents were from Sigma (USA), Life Technologies (USA), or Pure Science (NZ). Construction of the pBAD_GFP11_T7LysH17A vector for the split-GFP assay A 165 base pair (bp) DNA cassette (Fig. 2) for expression of protein fragments with an N-terminal His6-tag and a C-terminal GFP11 tag, and restriction enzyme sites for sticky end (SpeI and XhoI) or blunt-end (PvuII) ligation, was created by an overlap extension PCR. An existing XhoI site was removed from the multiple cloning site of the pBAD-MCS vector using QuikChange mutagenesis (Stratagene) with the mutagenic primer 50 -GCTTGC GGCCGCACTCGTGAGCTTGGCTGTTTTGG-30 and its complement. The DNA cassette was then inserted between the NcoI and the HindIII sites of the mutated pBAD-MCS vector to generate the pBAD_GFP11 vector. The pBAD_GFP11 vector was further modified to express basal levels of the T7 polymerase inhibitor, T7 lysozyme. A 636 bp fragment of the plasmid pLysS [34], encompassing the gene for T7 lysozyme, was amplified with 50 and 30 SphI restriction enzymes sites using PCR with the primers 50 -CTGTGCATGCGGCCCATTGGCTGCCTC-30 and 50 -CGGCGTAGAGCATGCGGGTCCCCTTTGATAGATTAA-30 . This fragment was then ligated into a SphI site in a nonessential region of the pBAD_GFP11 vector to create pBAD_GFP11_T7Lys (Supplementary Fig. 1). Finally, the mutation H17A was introduced into the T7 lysozyme gene to reduce its amidase activity, using a two-step megaprimer-based site-directed mutagenesis method [35]. In brief, a 557 bp megaprimer was amplified by PCR using the H17A mutagenic primer 50 -GACGCAATCTTTGTTGCCTGCTCGGCTACCAGG-30 and a primer flanking the T7 lysozyme gene, 50 -GGCCCATTGGCTGCCTC-30 . The megaprimer was then employed for standard QuikChange mutagenesis, generating the plasmid pBAD_GFP11_ T7LysH17A used for screening of p85a fragment libraries. This vector is available from Addgene (plasmid 59591). Testing functionality of the pBAD_GFP11 vector series The pBAD_GFP11 vector and its variants were tested in the split-GFP solubility assay using maltose binding protein (MBP) and an insoluble truncated form of MBP consisting of residues 1– 183 (MBP1–183). The genes for MBP and MBP1–183 were amplified by PCR and directionally inserted into pBAD_GFP11 or related variants using the available SpeI and XhoI sites. The expression vectors were then transformed into BL21(DE3) Gold cells carrying pET_GFP1–10, +/ pLysS, before conducting the split-GFP solubility assay. For all experiments employing plasmid pLysS the media were supplemented with 10 lg/ml chloramphenicol to maintain positive selection for the plasmid. Split-GFP solubility assay The protocol for the in vivo split-GFP solubility screen was similar to that originally described by Waldo and co-workers [22,23]. Libraries of pBAD_GFP11_T7LysH17A vectors carrying fragments of the target gene, created as described below, were transformed into BL21(DE3) Gold/pET_GFP1–10 cells by electroporation. Cells were plated on prewetted, supported nitrocellulose membranes (Pall Corporation, USA) placed on 12  12 cm square LB agar plates supplemented with 100 lg/ml ampicillin and 50 lg/ml kanamycin. Plates were incubated at 15 h for 37 °C. To achieve an appropriate colony density (500 colonies per plate) various dilutions of the transformed cells were plated on the first day. The remainder of the transformed cells were stored at 4 °C overnight and plated out on the second day at the appropriate dilution.

A protocol for locating protein domains / E.M.M. Bulloch, R.L. Kingston / Anal. Biochem. 465 (2014) 53–62

55

Fig.1. Principal steps of Domain Seeking. (I) Gene fragmentation is achieved either by mechanical shearing with nebulization or using the uracil-dependent approach illustrated [18]. In the latter approach, the target gene is amplified using dUTP-doped PCR to randomly incorporate uracil at thymidine sites. An enzymatic digest is then carried out with a cocktail of uracil-DNA glycosylase, endonuclease IV, and S1 nuclease to excise the uracil bases and create a double stranded break at the A-basic sites. (II) A split-GFP assay based on that developed by Waldo co-workers is used for the protein fragment solubility screen [22,23]. Gene fragments are ligated into the vector pBAD_GFP11_T7LysH17A and this vector library is used to transform BL21(DE3) Gold/pET_GFP1–10 cells. Colonies are grown overnight on nitrocellulose membrane. The next day the solubility screen is carried out by inducing expression of GFP11-tagged fragments with arabinose, followed by a rest phase on noninducing media, which allows turnover of insoluble or unfolded fragments. Finally expression of GFP1–10 is induced. Fluorescent colonies are immediately picked and the extracted plasmids are DNA sequenced. (III) Soluble fragments are mapped to the target protein and analyzed on a global basis to demarcate domain boundaries. Fragment start and end point frequencies are analyzed as a potential marker of interdomain regions. Clustering algorithms are used to group related fragments and determine prototypes for each group that reflect the underlying domains.

Fig.2. Cloning site of pBAD_GFP11 for fragment solubility screening with the split-GFP assay. Gene fragments are blunt-end ligated into the PvuII site for expression with an N-terminal His6-tag and a C-terminal GFP-11 tag, under the control of an arabinose-inducible pBAD promoter [43].

Colonies were screened for expression of soluble protein fragments by transferring the nitrocellulose membranes onto prewarmed LB agar plates, as above, supplemented with 0.2% w/v arabinose to induce expression of the GFP11-tagged protein fragments. Plates were incubated for 2 h at 37 °C. A subsequent rest phase allowing turnover of insoluble or unfolded protein fragments was implemented by transferring membranes back to the original noninducing LB agar plates and incubating for 1 h at 37 °C. Finally, membranes were transferred to LB agar plates, as above, supplemented with 0.5 mM isopropyl b-D-1-thiogalactopyranoside (IPTG) to induce expression of GFP1–10. Plates were incubated for 2–4 h at 37 °C before viewing under a Illumatool Bright Light System (LT-9900, Lightools Research, USA) equipped with a 480 nm/40 nm band-pass excitation filter (light blue) and a 515 nm cutoff emission filter (deep yellow). Colonies with medium to high levels of fluorescence were immediately picked and used to inoculate 5 ml liquid cultures (LB media supplemented with 100 lg/ml ampicillin), allowing preparation of glycerol stocks and isolation of plasmids for DNA sequencing.

Gene fragmentation by nebulization Initially the synthetic p85a gene, in the plasmid pMK-p85a, was amplified by PCR using the primers 50 -ATGGCCGCTGAAGGTTATCAG-30 and 50 -TTAACGACGCTGCTGTGCATAC-30 . For later nebulization experiments, designed to reduce end bias in the fragment library, the p85a gene was amplified with 400 bp of the flanking vector DNA appended to the 50 and 30 ends using the primers 50 CAAATAGGGGTTCCGCGCAC-30 and 50 -GCTTGGAGCGAACGACCTACAC-30 . Typically 10–20 lg of the purified PCR product was suspended in 2 ml nebulization buffer (10 mM TrisHCl, pH 8, and 50% v/v glycerol) and placed in a small nebulizer (Life Technologies) on ice. Mechanical shearing of the DNA was achieved by passing compressed nitrogen though the device at 20–35 psi for 20 min in total, to produce the desired fragment size distribution. At 5 min intervals the device was removed from the gas and centrifuged briefly at 100g to bring the DNA-containing liquid to the bottom of the device. Following nebulization the gene fragments were isolated from the nebulization buffer using a PCR clean-up kit

56

A protocol for locating protein domains / E.M.M. Bulloch, R.L. Kingston / Anal. Biochem. 465 (2014) 53–62

(Macherey Nagel, Germany). End repair and 50 phosphorylation of the gene fragments were carried out using a Quick Blunting Kit (New England Biolabs), containing T4 DNA polymerase and T4 polynucleotide kinase. Size selection of the fragment library was then performed as detailed below.

ATP, 5% w/v polyethylene glycol 8000) and were incubated at 14 °C for 16 h. DNA was purified from the ligation reactions using a PCR clean-up kit (Macherey Nagel, Germany), and transformed into BL21(DE3) Gold/pET_GFP1–10 electrocompetent cells for the split-GFP solubility screen.

Gene fragmentation by uracil doping and enzymatic digest

Small-scale purifications of p85a fragment prototypes from cluster analysis

The protocol for uracil doping of the p85a gene and subsequent fragmentation by a uracil-specific enzymatic digest was based on that originally described by Reich et al. [18]. Primers used for the dUTP-doped PCR of the p85a gene were as detailed for the initial gene fragmentation by nebulization. PCR was carried out using PfuTurbo Cx Hotstart DNA polymerase (Agilent Technologies, USA), which is engineered to amplify uracil-containing templates with high fidelity. Reactions contained the following: 1X PfuTurbo Cx buffer; 1 mM each of dATP, dGTP, and dCTP; 1 mM dUTP/dTTP mix; 1 lM each of the forward and reverse primers, 50 lg pMKp85a plasmid template and 2.5 units of PfuTurbo Cx Hotstart DNA polymerase. For p85a, substituting 1.5% of the dTTP for dUTP produced fragments with the desired size distribution. The polymerase was activated by incubating for 2 min at 95 °C, and amplification was carried out by 30 cycles of 30 s at 95 °C (denaturation), 30 s at 65 °C (annealing), and 2 min at 72 °C (extension). An extended digest of the dUTP-doped PCR product using three enzymes was carried out to produce 50 phosphorylated and bluntended fragments. Digests contained approximately 10 lg of purified uracil-doped PCR product, 1X NEB buffer 3 (50 mM TrisHCl, 100 mM NaCl, 10 mM MgCl2, 1 mM DTT, pH 7.9), 15 units uracil DNA glycosylase (NEB), 15 units endonuclease IV (NEB), and 15 units S1 nuclease (Promega). Samples were incubated for 14– 16 h at 37 °C. Size selection of the fragments was then carried out as detailed below. Size selection of fragment libraries

The p85a prototype fragments determined by cluster analysis were amplified by PCR with 50 SpeI and 30 XhoI restriction enzyme sites (see Supplementary Table 1 for primers) and ligated into a laboratory-modified variant of pET15b (into which a SpeI site had been introduced immediately following the polyhistidine-tag encoding sequence). This allowed fragments to be expressed with a noncleavable N-terminal HHHHHHTS-tag. Expression plasmids were transformed into BL21(DE3) (Stratagene). Transformed bacteria were grown in 25 mL LB media supplemented with 100 lg/ mL ampicillin. Cultures were shaken in flasks at 37 °C until the optical density at 600 nm was 0.8. Protein expression was then induced by the addition of 0.5 mM IPTG and growth was continued for 4 h at 37 °C before cells were harvested by centrifugation. Cell pellets were resuspended in 1 mL of lysis buffer (20 mM TrisHCl, pH 8, 300 mM NaCl, 2 mM 2-mercaptoethanol and Complete EDTA-free protease inhibitor cocktail (Roche, Switzerland)). Cell lysis was carried out by sonication of samples on ice using a Soniprep 150 (MSE, UK). Cell debris was pelleted by high-speed centrifugation, and the clarified supernatants were transferred to spin columns containing 100 lL Talon resin (Clontech, USA) preequilibrated with the lysis buffer. After loading the supernatant, the resin was washed three times with 700 lL lysis buffer containing 5 mM imidazole. His6-tagged proteins were eluted with 200 lL lysis buffer containing 250 mM imidazole. Analysis of p85a fragment data with fuzzy clustering algorithms

Gene fragments prepared by nebulization or the uracil-dependent method were separated by electrophoresis using 1–1.5% w/v agarose gels run in 40 mM Tris-acetate buffer, pH 8, with 1 mM ethylenediaminetetraacetic acid (EDTA). DNA was extracted by adapting the protocol of Winberg et al. [36]. Once sufficient resolution of fragments was achieved, slits were made in the gel across the sample lanes at the highest and lowest desired fragment sizes, based on a 1 kb plus DNA ladder (Invitrogen). A small piece of dry diethylaminoethyl (DEAE) cellulose paper (DE-81, Whatman, USA) was inserted into each slit. Electrophoresis was then continued in order to bind fragments within the desired size range to the lower paper, with the upper paper acting to block migration of larger fragments. DNA was eluted from the lower paper by immersion in 200 lL high salt buffer (50 mM TrisHCl, 1 M NaCl, 10 mM EDTA) in a 1.5 mL tube and incubating for 10 min at 65 °C. After centrifuging the sample for 5 min at 10,000g to pellet the paper, the supernatant was collected. The elution step was repeated twice more and the supernatants were combined. Gene fragments were extracted from the combined sample and concentrated using a commercial PCR clean-up kit (Macherey-Nagel, Germany).

To analyze the soluble fragment data we used a generalization of the Fuzzy C-means algorithm [37,38]: a widely used iterative reallocation algorithm which partitions data into a fixed number of clusters according to a measure of similarity. We employed several different similarity measures for interval-valued data (e.g., city-block distance [39] and squared Euclidian distance [40]). The algorithm returns both cluster prototypes (representative intervals associated with each cluster) and the degree to which each datum is associated with a cluster (the membership). Dependent on the similarity measure employed, the boundaries of the cluster prototypes are computed as a weighted mean or median of the fragment boundaries, with the weighting done according to the cluster membership. As the algorithm returns only a locally optimal partition of the data, it needs to be run hundreds of times with random initialization to identify the best possible partition. Full details of the clustering analysis will be published elsewhere. Code implementing the algorithm (in Fortran95/2003) is available on request from the authors and at our laboratory website (http://persephone.sbs.auckland.ac.nz/richard/lab/).

Transfer of fragments into the spilt-GFP solubility screen

Results

Size-selected and 50 -phosphorylated gene fragments were blunt-end-ligated into pBAD_GFP11_T7LysH17A vector that had been digested with high fidelity PvuII (NEB) and 50 dephosphorylated with calf intestinal alkaline phosphatase (NEB, USA). Ligation reactions typically contained 200 ng vector, 50–200 ng DNA fragments, and 1 unit T4 DNA ligase (Roche, Switzerland) in ligation buffer (70 mM TrisHCl, pH 7.5, 5 mM MgCl2, 5 mM DTT, 1 mM

Overall Domain Seeking workflow There are three principal steps in our Domain Seeking methodology (Fig. 1). First, a random, blunt-ended, and size-selected fragment library is generated for the target gene. Second, this library is cloned into a split-GFP system for medium-throughput expression and solubility screening of the corresponding protein fragments in

A protocol for locating protein domains / E.M.M. Bulloch, R.L. Kingston / Anal. Biochem. 465 (2014) 53–62

E. coli. Third, global analysis of the fragmentation data is used to deduce the likely domain boundaries within the target protein, and to design constructs for in vitro solubility testing. To test and develop the methodology used for Domain Seeking we applied it to a structurally characterized and multidomain protein, human p85a. This protein has five distinct domains, and structures have been determined for each of these [24–33]. In addition, p85a is frequently used as a benchmark for domain identification methods [14,18,21,41], allowing us to compare our results with those obtained previously. As required for the in vivo splitGFP assay, a synthetic p85a gene that is codon-optimized for expression in E. coli was employed. Modification of a split-GFP solubility screen For solubility screening of protein fragments we adapted the highly sensitive split-GFP assay developed by Waldo and co-workers [22,23] (Fig. 1). In the original vector system for the split-GFP assay, protein fragments are expressed in E. coli with a polyhistidine tag appended to their N-terminus and the b-11 strand of GFP (GFP11) appended to their C-terminus. Expression is under control of a tetracycline-inducible promoter. For our simplified approach we required a vector compatible with direct blunt-end cloning of gene fragment libraries. We also found it preferable to avoid the use of anhydrotetracycline to induce protein expression, as this compound is bactericidal to E. coli at relatively low concentrations [42]. A DNA cassette was generated containing an N-terminal polyhistidine tag, cloning sites enabling sticky end (SpeI/XhoI) and blunt-end (PvuII) ligation, a 10 amino acid nonstructured linker and the C-terminal GFP11 tag (Fig. 2). This cassette was ligated into the vector pBAD-MCS, for arabinose-inducible expression of the fragment library, generating the vector pBAD_GFP11. For our application, advantages of using a pBAD vector include moderate levels of protein overexpression, rapid kinetics of induction in the presence of >0.05% w/v arabinose, and a steep drop-off in induction levels as arabinose levels are decreased below 0.002% w/v [43]. The original pET_GFP1–10 vector designed for IPTG-inducible expression of the remainder of GFP (GFP1–10) [22,23] was used without modification. To achieve clear discrimination between soluble and insoluble fragments using the split-GFP assay, the expression of GFP11tagged fragments is initially induced, then switched off during a rest phase, before expression of GFP1–10 commences [22,23]. This sequential induction is important, as during the rest phase

57

unfolded or insoluble protein fragments are either rapidly recycled within E. coli or packaged into inclusion bodies. In contrast, properly folded and soluble fragments will persist in the cytosol [44]. If GFP1–10 is induced before the GFP11-tagged fragments can be processed by the cell, the reconstituted GFP11/GFP1–10 complex may have a chaperone effect on other otherwise insoluble or unfolded fragments, generating false positives in the assay. The protocol for the in vivo split-GFP solubility screen is similar to that previously published (Fig. 1) [22,23]. In brief pBAD_GFP11 vectors carrying fragments of the target gene are transformed into BL21(DE3) Gold/pET_GFP1–10 cells and the colonies are grown on membranes placed on noninducing media. The next day, expression of the GFP11-tagged protein fragments is first induced by shifting the colonies to an arabinose-containing media for 2 h, a rest phase is then incorporated by moving the colonies back onto a noninducing media for 1 h, and finally expression of GFP1–10 is induced on IPTG-containing media. Following fluorescence imaging of the plates, colonies are picked manually for isolation of plasmids and DNA sequencing. In order to test the suitability of the pBAD_GFP11 vector for the split-GFP solubility screen, both the highly soluble E. coli maltose binding protein and an insoluble truncated MBP variant (amino acids 1–183) were inserted into the vector. During initial tests of the split-GFP assay using this vector system, both colonies expressing the soluble MBP and the insoluble MBP1–183 were highly fluorescent (Fig. 3A). Through monitoring colony fluorescence at each step of the split-GFP assay we determined that leaky expression of GFP1–10 from its T7 promoter was occurring prior to induction with IPTG. This was overcome by transformation of the plasmid pLysS into the host strain for basal expression of the T7 polymerase inhibitor, T7 lysozyme [34]. Colonies expressing MBP were then clearly differentiated from those expressing MBP1–183 when a rest phase was implemented between the two induction steps (Fig. 3B). Experiments varying the length of the rest phase showed that 30 min rest was sufficient to substantially reduce fluorescence from cells expressing insoluble protein (MBP1–183), while further fluorescence reduction was not evident beyond 1 h (data not shown). A standard rest phase of 1 h was therefore adopted in subsequent experiments. To maintain the simplicity of the Domain Seeking system we then incorporated the gene for T7 lysozyme directly into the pBAD_GFP11 vector. Based on the design of pLysS by Studier [34], we transferred the T7 lysozyme gene into pBAD_GFP11 without a specific upstream promoter to generate the vector

Fig.3. Performance of the modified split-GFP solubility assay. (A) Comparison of colony fluorescence for soluble and insoluble proteins, in the absence or presence of T7 lysozyme. BL21.DE3 Gold/pET_GFP1–10 cells were transformed with pBAD_GFP11 (left panels) or pBAD_GFP11_T7Lys (right panels) carrying the MBP (top panels) or MBP1– 183 (bottom panels) genes. A split-GFP assay was conducted on the colonies with a 1 h rest phase. (B) Effect of the rest phase on the split-GFP assay incorporating T7 lysozyme. BL21(DE3) Gold/pET_GFP1–10/pLysS cells were transformed with the pBAD_GFP11 vector carrying MBP, MBP1–183, or no protein fragment (negative control). A split-GFP assay was then conducted on the colonies with either no rest phase (top panel) or a 1 h rest phase (lower panel) prior to induction of GFP1–10 expression.

58

A protocol for locating protein domains / E.M.M. Bulloch, R.L. Kingston / Anal. Biochem. 465 (2014) 53–62

pBAD_GFP11_T7Lys. Control experiments in which MBP and MBP1–183 were expressed from this vector indicate that the level of T7 lysozyme expression from pBAD_GFP11_T7Lys is sufficient to block the basal T7 RNA polymerase activity (Fig. 3A). As a final modification to the expression vector we introduced the mutation H17A to the T7 lysozyme gene. This mutation reduces the ability of T7 lysozyme to digest the peptidoglycan cell wall of E. coli, while retaining its ability to bind and inhibit T7 RNA polymerase [45,46]. Reducing the T7 lysozyme amidase activity was essential for maintaining cell stability in our library screening application. The resulting pBAD_GFP11_T7LysH17A vector (Supplementary Fig. 1) was used for the screening of p85a fragment libraries as described below. Generation of gene fragment libraries In selecting methods for gene fragmentation our priorities were to minimize sequence or positional bias, reliably control the fragment size distribution, and limit cost. On this basis we choose to trial a mechanical method of fragmentation, and a uracil-doping/ enzymatic digest approach. To streamline our protocols the blunt-ended fragments we generated were directly cloned into the expression vector for the split-GFP solubility assays. As with any method that uses nondirectional cloning of random DNA fragments, there is overall a 1/18 chance of fragments being ligated in the correct open reading frame for expression. We chose not to incorporate a selection step for the correct open reading frame, prior to solubility screening. A number of methods have been established to achieve open reading frame selection [14,20,21], decreasing the number of clones that need to be screened. However, these methods are also relatively laborious and the additional selection steps may create bias in the fragment library prior to the solubility screen. An important step in creating gene fragment libraries is effective size selection. This is necessary to eliminate small fragments from the library that may be preferentially ligated in subsequent cloning steps, and to further tailor the fragment size range to a particular protein target. For the majority of our experiments on p85a we chose to create and screen two different size selected fragment libraries; 200 to 400 and 400 to 1000 bp. The first library was designed to identify fragments corresponding to the three srchomology (SH) domains of p85a: SH3 (84 residues), N-SH2 (113 residues), and C-SH2 (103 residues). The second library was designed to identify fragments corresponding to the larger breakpoint cluster region-homology (BCR, 185 residues) and coiled coil (162 residues) domains of p85a. Separating the library into two or more broad size ranges helps reduce the potential bias resulting from the higher efficiency ligation of smaller DNA fragments. Size selection was achieved using gel electrophoresis. Gene fragmentation by nebulization We first tested the mechanical fragmentation method of nebulization [47]. The advantages of this method are that it exhibits no sequence bias; the fragment size distribution has a low variance; the mean fragment size is readily controlled (see Supplementary Fig. 2); and nebulization devices are inexpensive and reusable. A disadvantage is that breakage can occur at various points in the phosphodiester backbone and hence fragments must be enzymatically end-repaired for downstream application The gene for p85a was amplified by PCR and fragment libraries were generated under appropriate nebulization conditions. Fragments were then end-repaired, 50 phosphorylated, size selected (200–650 bp), and cloned into the pBAD_GFP11_T7LysH17A vector for the split-GFP solubility assay. Initially a small library of these fragments (approximately 1.1  103 colonies) was screened using

the solubility assay and 20 fragments were selected. A strong end bias effect was evident among these, with 17 fragments retaining the original 5’ end of the gene (Supplementary Fig. 3). This endbias effect has been noted previously when using mechanical shearing to facilitate domain mapping [16]. There are three probable causes. First, mechanical shearing is less efficient near the end of a DNA fragment. Second, fragments retaining the original 50 or 30 ends of the gene are less effected by the fundamental limitations of the end repair process. Third, fragments retaining the original 50 or 30 end of the gene have a higher probability of being transferred into the screening system in the correct open reading frame. Options for overcoming the end-bias effect include adding noncoding DNA to both ends of the gene during the PCR amplification, or circularizing the PCR product prior to nebulization. We chose to flank each end of the p85a gene with 400 bp of noncoding DNA. This ensured that fragmentation events were equiprobable over the full length of the p85a gene. Two fragment libraries (200– 400 and 400–1000 bp) were generated and approximately 9  103 colonies were screened with the split-GFP assay. All 14 of the p85a fragments selected based on the screen were internal and were spread over 4 of the 5 known structural domains of p85a (Supplementary Fig. 4). However, the low number of hits obtained in this screen is likely to reflect the remaining preferential ligation of fragments carrying the unmodified ends generated in PCR amplification. Gene fragmentation by dUTP-doped PCR and uracil-base excision The second gene fragmentation method investigated was initially described as part of the Combinatorial Domain Hunting approach to soluble protein fragment mapping [18]. In this method the target gene is amplified using PCR in the presence of limited amounts of dUTP, leading to the random substitution of uracil at thymidine sites in the product (Fig. 1). This uracil-containing DNA is subsequently digested with an enzyme cocktail of uracilDNA glycosylase (excision of uracil bases), endonuclease IV (generation of single stranded nicks at A-basic sites), and S1 nuclease (conversion of single-stranded nicks to double-stranded breaks). The final product is a library of 50 -phosphorylated blunt-ended fragments. A significant advantage of the dUTP-doped PCR/enzymatic digest approach is that the fragment size distribution is easily tuned by altering the dUTP:dTTP ratio used in the PCR of the target gene [10,18]. In addition, the enzymatic digest does not need to be tightly time controlled, as it is relatively specific for uracil sites. A final benefit is that fragments produced do not require end repair and can be blunt-end ligated directly into the expression vector for solubility screening. A disadvantage of this approach is that fragmentation can only occur at A:T sites. Hence the nature of the fragment library is influenced by the A/T distribution of the target gene, and not all gene fragments can be generated. However, codon optimization of the DNA sequence can be used to minimize compositional bias and maximize the number of possible fragment start or end sites. In the Combinatorial Domain Hunting approach bias is further reduced by ligating the fragments into a mixture of vectors with nine different reading frames [18]. This means that every fragment can potentially be captured in frame, increasing the diversity of the screened fragment library. In our simple system, in which blunt-end fragments are cloned into a single expression vector, only fragments that are themselves in-frame can be expressed in the solubility assay. To understand how A/T distribution could influence the p85a fragment library, we analyzed the potential in-frame fragmentation sites across our synthetic p85a gene (Supplementary Fig. 5). This shows a relatively even distribution, with potential start or end sites present

A protocol for locating protein domains / E.M.M. Bulloch, R.L. Kingston / Anal. Biochem. 465 (2014) 53–62

59

Uracil-doped PCR was used to fragment the p85a gene (Fig. 4). Substitution of 1.5% of the dTTP with dUTP generated fragments ranging from 100 to 1200 bp in size. However the size distribution was skewed, with the majority of fragments less than 600 bp in size. Fragments were size-selected to create two libraries (200– 400 and 400–1000 bp), which were screened with the split-GFP solubility assay. In total 1  104 colonies were screened and from these 81 p85a fragments were selected. These fragments are evenly distributed across the target protein (Supplementary Fig. 6). This was our preferred method for generating fragment libraries due to the relatively low sequence and positional bias that can be achieved, the ability to directly ligate the fragments into the expression vector without end repair, and the reproducibility of the technique. Global data analysis to predict structural domains

Fig.4. Random fragmentation of the p85a gene using the uracil-doping approach [18]. (1) Product from the PCR amplification of the p85a gene with 1.5% dUTP and 98.5% dTTP. (2) PCR product following an enzymatic digest with a mixture of uracilDNA glycosylase, endonuclease IV, and S1 nuclease. Dotted lines indicate regions of gel selected for extraction of the 200–400 and 400–1000 bp fragment libraries. Samples were run on a 1.5% w/v agarose gel using 40 mM Tris-acetate buffer, pH 8, with 1 mM EDTA.

within a five-residue window of the majority of residues. Hence, we did not further optimize the A/T composition of our synthetic p85a gene.

As detailed above, fragment libraries for p85a were generated by nebulization or uracil doping, ligated into the pBAD_GFP11_T7LysH17A expression vector, and screened using the modified split-GFP assay. The resulting data were then combined and globally analyzed. In total 115 p85a fragments from colonies with medium to high levels of fluorescence were identified and analyzed, yielding 107 unique fragments after removal of duplicates (Fig. 5). These fragments are relatively evenly spread across the five known structural domains of the protein. As a proof of concept, we sought to use this modest number of soluble fragments to demarcate the structural domains within p85a. Our goal was to achieve this without reference to the known structural information on

Fig.5. Global analysis of the p85a fragment data generated using the split-GFP solubility screen of p85a gene fragment libraries. The top panel shows the frequency of start and end sites for the p85a fragments within a 5 residue window. The central panel shows the known structural domains of p85a. For the coiled coil domain the two extended a-helices are highlighted in dark purple with the remainder of the domain indicated by light purple. The bottom panel shows the fragments mapped to the p85a protein and grouped according to a fuzzy clustering algorithm with 7 clusters specified. Fragments are assigned to the cluster in which they have the highest membership. Orange boxes indicate the boundaries of the 7 cluster prototypes. Left to right the cluster prototypes are 1–119, 85–223, 139–243, 302–432, 432–519, 427–598, and 577–718.

60

A protocol for locating protein domains / E.M.M. Bulloch, R.L. Kingston / Anal. Biochem. 465 (2014) 53–62

p85a, to demonstrate the potential for de novo application to an uncharacterized protein target. Global analysis helps to overcome the principal limitation of using soluble protein fragments to identify domain boundaries, which is that the data are intrinsically imprecise. In a large protein, structural domains are often connected to each other by flexible linkers, or flanked by long intrinsically disordered regions of the polypeptide. The effect of retaining such flanking sequences on domain solubility is unpredictable, but potentially very significant. Addition of highly polar and unstructured polypeptides to structured domains is a recognized strategy for enhancing domain solubility. Working in the opposite direction, the secondary structures at domain boundaries can often undergo limited truncation without a significant effect on solubility. These factors combine to create ‘‘ragged’’ ends on individual fragments encompassing the same structural domain. Despite these limitations, the overall fragment expression and solubility data obtained do reflect the underlying structural domains. Frequency analyses of start and end sites in hit p85a fragments A simple analysis of the start and end site frequencies of the fragment data, averaged over a 5 residue window, clearly defines some locations as fragmentation ‘‘hotspots’’ (Fig. 5). The prolinerich loop region from residues 84 to 130, separating the SH3 and BCR domains [24–26,30], is indicated by peaks in the end site frequency. Similarly, the proline-rich region from residues 303 to 314, separating the BCR and N-SH2 domains [26,27,31], corresponds to a peak in the start site frequency. The boundary between the NSH2 and the coiled coil domain is the best defined by the data. The prominent peak both in start and in end site frequency, centered around Tyr431, correlates precisely with the short linker connecting these two domains [27,28]. The boundary between the coiled coil domain and the C-terminal SH2 domain, from residues 600 to 617 [28,29], is the least well defined by the frequency analysis. Interestingly, structural features of the coiled coil domain itself are reflected in the data. This domain comprises an antiparallel intermolecular coiled coil (residues 438–587), followed by a short a-helix (residues 590–599) [28]. There is a spike in both start and end site frequency around residue 515 corresponding to the loop that connects the two ahelices of the coiled coil. There is also peak in the end site frequency peak around residue 583, close to the C-terminus of the coiled coil. Although the start of the C-SH2 domain is not clearly defined by the frequency data, the end is, with 8 fragments terminating within the last 6 amino acids of p85a. Only 2 of these terminate on the last amino acid of p85a, indicating that this is not an end-bias effect. The only major feature of the fragmentation data that does not correlate well with known structural information is the large spike in end site frequency within the BCR domain. Of the fragments identified only three span the majority of the BCR domain. The fragmentation hotspots around residues 224 and 242 are not located between secondary structure elements, but occur instead within two a-helices that are integral to the BCR domain structure [26]. The low number of fragments that encompass the entire BCR domain (185 residues) may result from the skewed size distribution in the DNA fragment libraries employed, with fragments below 600 bp predominating. It is interesting to note that in previous screens of p85a fragments carried out using split-GFP [14,41], or dot-blot-based solubility assays [18,21], internal fragmentation of BCR has been observed in the same region. Cluster analysis of hit p85a fragments Although analysis of the fragmentation frequency at a given site is informative, this type of analysis does not make full use of the

data, which are interval valued; each individual fragment having both a start and an end point. If this information is retained, procedures that systematically group the fragments according to some measure of similarity can more directly reveal the organization of the structural domains within a protein. We employed a fuzzy clustering algorithm, suitably adapted for interval data, to address this problem [37,38,39,40]. In contrast to hard clustering, where each fragment would be assigned to one and only one cluster, fuzzy clustering allows each fragment to belong to multiple clusters, with an associated degree of membership. Fuzzy clustering is the most appropriate choice for this analysis, because a soluble fragment could encompass several structural domains. The algorithm partitions the fragments into a defined number of clusters and simultaneously determines a ‘‘prototype’’ for each cluster. The cluster prototypes are representative of the structural domains that underpin each cluster. A key problem in cluster analysis is to determine the optimal number of clusters. A simple method for achieving this is to begin the analysis with only two clusters and then repeat with increasing number of clusters. The clustering algorithm optimizes the fit between the data and the cluster prototypes, by minimizing an objective function. As the number of clusters is increased, the objective function will drop rapidly until the optimal number of clusters is approached, after which it will decrease more gradually. The identification of this semiabrupt transition or ‘‘elbow’’ in plots of objective function versus cluster number provides an effective, if heuristic, means of determining the appropriate number of clusters. For the p85a fragment data, the appropriate number of clusters is predicted to be between 5 and 7 (Supplementary Fig. 7). This correlates well with the presence of five major structural domains within the protein. In Fig. 5 the results of partitioning the soluble fragments into 7 clusters is illustrated, together with the resulting cluster prototypes (corresponding results for 5 and 6 clusters are presented in Supplementary Figs. 8 and 9). Combining and comparing the prototypes for the 5, 6, and 7 cluster analysis shows that three are essentially invariantly conserved, indicating that they are most clearly defined by the data. These are the prototypes that completely encompass the SH3 (prototype 1–119), N-SH2 (prototype 301–432), and C-SH2 (prototype 577–718) domains. The three different prototypes associated with the coiled coil domain correlate with the secondary structure of this domain. These approximately match with one helix of the coiled coil (prototype 432–519), both helices of the coiled coil (prototype 427– 581), or the entire domain (prototype 427–598). None of the prototypes include the intact BCR domain. Instead four different prototypes (79–223, 85–223, 110–226, 139–243) are observed from the N-terminal half of the BCR domain, all starting around the end of the SH3 domain or beginning of the BCR domain and ending midway through the BCR domain. Although fragments that span the majority of the BCR domain are present among the hits (Fig. 5), these do not generate a separate prototype in the cluster analysis because they are outweighed by the large number that only include the N-terminal half. In vitro solubility tests of p85a prototype fragments Cluster analysis correctly demarcated most of the structural domains with p85a. To further validate the analysis we tested whether the prototypes determined for each cluster are themselves soluble fragments of p85a. Again, combining the results of the five, six, and seven cluster analysis, and treating prototypes with endpoints within 2 amino acids as redundant, produces 10 unique cluster prototypes. Each of these was cloned into a pETbased vector for overexpression with an N-terminal His6-tag in E. coli. The soluble components of the cell lysates were applied to

A protocol for locating protein domains / E.M.M. Bulloch, R.L. Kingston / Anal. Biochem. 465 (2014) 53–62

Fig.6. Small-scale purifications of soluble p85a prototype fragments from cluster analysis. Prototypes are 1–119 (lane 1), 79–223 (lane 2), 85–223 (lane 3), 110–226 (lane 4), 139–243 (lane 5), 301–432 (lane 6), 427–581 (lane 7), 427–598 (lane 8), 432–519 (lane 9), and 577–718 (lane 10). A white arrowhead indicates the relevant soluble protein band for each prototype. Prototypes were cloned into a derivative of pET15b for overexpression with a noncleavable N-terminal His6-tag in BL21(DE3) cells. The soluble component of the cell lysate was applied to cobalt derivatized affinity resin and the His6-tagged p85a fragments were eluted with imidazole. There was detectable expression of prototypes 139–243 (lane 5) and 432–519 (lane 9); however, they were insoluble. There was no detectable expression of prototype 577–718 (lane 10); hence the solubility of this prototype could not be determined.

a cobalt-derivatized affinity resin and bound proteins subsequently eluted with imidazole. Out of the 10 prototypes, six are soluble, one is sparingly soluble, and two are insoluble, in vitro (Fig. 6), without any optimization of the expression conditions. The prototype for the C-SH2 domain (577–718) did not express and therefore its solubility could not be determined. The soluble prototypes include those that encompass the intact SH3 (1–119), N-SH2 (301–432), and coiled coil (427–598) domains. Interestingly, 3 of the 4 prototypes for the N-terminal region of the BCR domain are also soluble (79– 223, 85–223, 110–226). The two prototypes that truncate the coiled coil domain, 427–581 and 432–519, are in vitro sparingly soluble and insoluble, respectively. Given the modest number of soluble fragments analyzed, the relatively large size of p85a, and its complexity, it is encouraging that the cluster analysis defined soluble domain prototypes that match with 3 of the 5 domains of p85a (SH3, N-SH2 and coiled coil domains). Although the solubility of the 577–718 prototype could not be assessed, it is interesting to note that this encompasses the entirety of the structured region of the C-SH2 domain [29,32,33]. Based on the overall results for p85a, we suggest that when Domain Seeking experiments are performed on structurally uncharacterized proteins, the cluster prototypes will form an effective starting point for subsequent construct design. Discussion We have demonstrated that the structural domains within a target protein can be identified by globally analyzing the data from a straightforward fragment solubility screen. To identify soluble protein fragments we adapted the in vivo split-GFP assay developed by Waldo and co-workers [22,23]. In our application of this assay it was essential to repress basal expression of the GFP barrel (GFP1–10) by incorporating a T7 RNA polymerase inhibitor. Constitutive expression of the inhibitor (a T7 lysozyme Zn2+ coordination mutant) reduced false positives from insoluble fragments, without compromising the bacterial host. We used two contrasting techniques to create gene fragment libraries for screening with the split-GFP assay. Mechanical fragmentation using nebulization led to end-bias effects, unless the gene was flanked with long stretches of noncoding DNA. Uracil-doped PCR coupled with enzymatic base excision [18] was our preferred method, as it produces libraries of

61

fragments with low sequence or positional bias that do not require enzymatic end repair. Identifying structural domain boundaries from soluble fragment data requires careful data analysis. Domain boundaries are defined imprecisely by solubility assays. This is because domains are generally tolerant of unstructured hydrophilic sequences appended at their termini and can often undergo limited truncation without unfolding. Global data analysis helps to overcome this inherent limitation of a solubility-based approach. We employed two complementary techniques: fragmentation site frequency analysis and cluster analysis using a generalization of the Fuzzy C-means algorithm (Fig. 5). The fragmentation sites mostly populate the interdomain regions of p85a, while cluster analysis defined prototypes that encompass four of its five known domains. To our knowledge this is the first time that cluster analysis has been applied to protein fragmentation data. Our domain seeking procedure rests on the observation that significant truncation of structural domains is usually globally destabilizing, leading to exposure of sequestered hydrophobic side chains and aggregation. In large part, the data for p85a support this observation. However they also suggest that the procedure may sometimes generate soluble structures at the subdomain level. This is exemplified by the identification of soluble fragments which correspond to just the N-terminal half of the p85a BCR domain (Fig. 5, see also [14,18,21,41]). While it would be interesting to characterize the stability and structure of these subdomain fragments, the result itself is biologically misleading, as the N-terminal half of the BCR domain is unlikely to retain function [26]. An additional issue arises from our use of a cell-based assay to report on protein solubility. This results in a very simple experimental workflow. However the complex intracellular environment may allow the persistence of fragments that would be insoluble in vitro, effectively generating false positives in the assay. An example is the p85a fragments encompassing just one of the extended a-helices of the coiled coil domain. These generated fluorescent colonies in the split-GFP assay, yet the corresponding fragment prototype proved insoluble in vitro. Ideally the fragment screening process would incorporate additional in vitro tests for solubility and stability. The latter might be carried out by high-throughput differential scanning calorimetry, or dye-based thermal shift assays. While this would allow more accurate discrimination of domain boundaries, these additional steps would be more resource intensive. We envisage that for many protein targets, results from the streamlined Domain Seeking approach, combined with bioinformatic analysis, will be sufficient to guide initial construct design. Once soluble constructs are obtained, domain boundaries can then be refined using standard biochemical and biophysical approaches. It is encouraging that we were able to demarcate most of the domains within p85a by screening fragment libraries of very modest size. For a gene the length of p85a (2172 bp) the number of theoretically possible in-frame fragments in the range 200–1000 bp is 1.4  105. For the fragment library generated by uracil doping, the screening of 1  104 colonies corresponds to around 550 in-frame fragments. From this, 81 fragments produced fluorescent colonies in the split-GFP assay: a hit rate of approximately 15%. These statistics are probably highly target dependent. A protein such as p85a, a kinase regulatory subunit with an extended structure and few interdomain interactions, appears likely to generate many more soluble fragments than a compact protein of comparable size. In summary, by applying our fragmentation, screening, and analysis protocols to p85a, we mapped four of the five known structural domains of this protein without reference to existing structural information. Although the size selection of library fragments does imply some prior knowledge of domain size, this presents no real practical issue. The majority of protein domains are

62

A protocol for locating protein domains / E.M.M. Bulloch, R.L. Kingston / Anal. Biochem. 465 (2014) 53–62

less than 200 amino acids in size, and domains larger than 500 amino acids are exceedingly rare [48]. Hence, this method could be used de novo for the analysis of many large or complex proteins that are otherwise experimentally intractable. Acknowledgments The pET_GFP1–10 vector for the split-GFP assay was a kind gift from Geoff Waldo at the Los Alamos National Laboratory, USA. The authors thank Jason Busby, Stephanie Dawes, James Dickson, Damien Fleetwood, Shaun Lott, and Paul Young in the School of Biological Sciences at the University of Auckland for useful discussions related to this work. We are also grateful to Nicole Herr for assistance with editing this manuscript. Esther Bulloch was supported by a New Zealand Ministry of Science Postdoctoral Fellowship. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.ab.2014.06.021. References [1] A. Lorence (Ed.), Recombinant Gene Expression, Springer, New York, 2012. [2] S.J. Hubbard, The structural aspects of limited proteolysis of native proteins, Biochim. Biophys. Acta 1382 (1998) 191–206. [3] A. Raymond, T. Haffner, N. Ng, D. Lorimer, B. Staker, L. Stewart, Gene design, cloning and protein-expression methods for high-value targets at the Seattle Structural Genomics Center for Infectious Disease, Acta Crystallogr. Sect. F Struct. Biol. Cryst. Commun. 67 (2011) 992–997. [4] S. Gräslund, J. Sagemark, H. Berglund, L.-G. Dahlgren, A. Flores, M. Hammarström, et al., The use of systematic N- and C-terminal deletions to promote production and structural studies of recombinant proteins, Protein Expr. Purif. 58 (2008) 210–221. [5] C. Bignon, C. Li, J. Lichière, B. Canard, B. Coutard, Improving the soluble expression of recombinant proteins by randomly shuffling 50 and 30 codingsequence ends, Acta Crystallogr. D Biol. Crystallogr. 69 (2013) 2580–2582. [6] R. Xiao, S. Anderson, J. Aramini, R. Belote, W.A. Buchwald, C. Ciccosanti, et al., The high-throughput protein sample production platform of the Northeast Structural Genomics Consortium, J. Struct. Biol. 172 (2010) 21–33. [7] D.J. Hart, G.S. Waldo, Library methods for structural biology of challenging proteins and their complexes, Curr. Opin. Struct. Biol. 23 (2013) 403–408. [8] H. Yumerefendi, D.C. Desravines, D.J. Hart, Library-based methods for identification of soluble expression constructs, Methods 55 (2011) 38–43. [9] M.R. Dyson, Selection of soluble protein expression constructs: the experimental determination of protein domain boundaries, Biochem. Soc. Trans. 38 (2010) 908–913. [10] C. Prodromou, R. Savva, P.C. Driscoll, DNA fragmentation-based combinatorial approaches to soluble protein expression, Drug Discov. Today 12 (2007) 931– 938. [11] R. Savva, C. Prodromou, P.C. Driscoll, DNA fragmentation based combinatorial approaches to soluble protein expression. Part II. Library expression, screening and scale-up, Drug Discov. Today 12 (2007) 939–947. [12] H. Yumerefendi, F. Tarendeau, P.J. Mas, D.J. Hart, ESPRIT: an automated, library-based method for mapping and soluble expression of protein domains from challenging targets, J. Struct. Biol. 172 (2010) 66–74. [13] T. Cornvik, S.-L. Dahlroth, A. Magnusdottir, M.D. Herman, R. Knaust, M. Ekberg, et al., Colony filtration blot: a new screening method for soluble protein expression in Escherichia coli, Nat. Methods 2 (2005) 507–509. [14] J.-D. Pedelacq, H.B. Nguyen, S. Cabantous, B.L. Mark, P. Listwan, C. Bell, et al., Experimental mapping of soluble protein domains using a hierarchical approach, Nucleic Acids Res. 39 (2011) e125. [15] T. Cornvik, S.-L. Dahlroth, A. Magnusdottir, S. Flodin, B. Engvall, V. Lieu, et al., An efficient and generic strategy for producing soluble human proteins and domains in E. coli by screening construct libraries, Proteins 65 (2006) 266–273. [16] K. Rottier, A. Faille, T. Prudhomme, C. Leblanc, C. Chalut, S. Cabantous, et al., Detection of soluble co-factor dependent protein expression in vivo: application to the 40 -phosphopantetheinyl transferase PptT from Mycobacterium tuberculosis, J. Struct. Biol. 183 (2013) 320–328. [17] M. Kawasaki, F. Inagaki, Random PCR-based screening for soluble domains using green fluorescent protein, Biochem. Biophys. Res. Commun. 280 (2001) 842–844. [18] S. Reich, L.H. Puckey, C.L. Cheetham, R. Harris, A.A.E. Ali, U. Bhattacharyya, et al., Combinatorial Domain Hunting: an effective approach for the identification of soluble protein domains adaptable to high-throughput applications, Protein Sci. 15 (2006) 2356–2365.

[19] M.R. Dyson, R.L. Perera, S.P. Shadbolt, L. Biderman, K. Bromek, N.V. Murzina, et al., Identification of soluble protein fragments by gene fragmentation and genetic selection, Nucleic Acids Res. 36 (2008) e51. [20] M.L. Gerth, W.M. Patrick, S. Lutz, A second-generation system for unbiased reading frame selection, Protein Eng. Des. Sel. 17 (2004) 595–602. [21] Y. An, H. Yumerefendi, P.J. Mas, A. Chesneau, D.J. Hart, ORF-selector ESPRIT: a second generation library screen for soluble protein expression employing precise open reading frame selection, J. Struct. Biol. 175 (2011) 189–197. [22] S. Cabantous, T.C. Terwilliger, G.S. Waldo, Protein tagging and detection with engineered self-assembling fragments of green fluorescent protein, Nat. Biotechnol. 23 (2005) 102–107. [23] S. Cabantous, G.S. Waldo, In vivo and in vitro protein solubility assays using split GFP, Nat. Methods 3 (2006) 845–854. [24] S. Koyama, H. Yu, D.C. Dalgarno, T.B. Shin, L.D. Zydowsky, S.L. Schreiber, Structure of the PI3K SH3 domain and analysis of the SH3 family, Cell 72 (1993) 945–952. [25] J. Liang, J.K. Chen, S.T. Schreiber, J. Clardy, Crystal structure of P13K SH3 domain at 2.0 angstroms resolution, J. Mol. Biol. 257 (1996) 632–643. [26] A. Musacchio, L.C. Cantley, S.C. Harrison, Crystal structure of the breakpoint cluster region-homology domain from phosphoinositide 3-kinase p85 alpha subunit, Proc. Natl. Acad. Sci. U.S.A. 93 (1996) 14373–14378. [27] R.T. Nolte, M.J. Eck, J. Schlessinger, S.E. Shoelson, S.C. Harrison, Crystal structure of the PI 3-kinase p85 amino-terminal SH2 domain and its phosphopeptide complexes, Nat. Struct. Mol. Biol. 3 (1996) 364–374. [28] N. Miled, Y. Yan, W.-C. Hon, O. Perisic, M. Zvelebil, Y. Inbar, et al., Mechanism of two classes of cancer mutations in the phosphoinositide 3-kinase catalytic subunit, Science 317 (2007) 239–242. [29] A.L. Breeze, B.V. Kara, D.G. Barratt, M. Anderson, J.C. Smith, R.W. Luke, et al., Structure of a specific peptide complex of the carboxy-terminal SH2 domain from the p85 alpha subunit of phosphatidylinositol 3-kinase, EMBO J. 15 (1996) 3579–3589. [30] G.W. Booker, I. Gout, A.K. Downing, P.C. Driscoll, J. Boyd, M.D. Waterfield, et al., Solution structure and ligand-binding site of the SH3 domain of the p85 alpha subunit of phosphatidylinositol 3-kinase, Cell 73 (1993) 813–822. [31] G.W. Booker, A.L. Breeze, A.K. Downing, G. Panayotou, I. Gout, M.D. Waterfield, et al., Structure of an SH2 domain of the p85 alpha subunit of phosphatidylinositol-3-OH kinase, Nature 358 (1992) 684–687. [32] G. Siegal, B. Davis, S.M. Kristensen, A. Sankar, J. Linacre, R.C. Stein, et al., Solution structure of the C-terminal SH2 domain of the p85 alpha regulatory subunit of phosphoinositide 3-kinase, J. Mol. Biol. 276 (1998) 461–478. [33] F.J. Hoedemaeker, G. Siegal, S.M. Roe, P.C. Driscoll, J.P. Abrahams, Crystal structure of the C-terminal SH2 domain of the p85alpha regulatory subunit of phosphoinositide 3-kinase: an SH2 domain mimicking its own substrate, J. Mol. Biol. 292 (1999) 763–770. [34] F.W. Studier, Use of bacteriophage T7 lysozyme to improve an inducible T7 expression system, J. Mol. Biol. 219 (1991) 37–44. [35] W.-C. Tseng, J.-W. Lin, T.-Y. Wei, T.-Y. Fang, A novel megaprimed and ligasefree, PCR-based, site-directed mutagenesis method, Anal. Biochem. 375 (2008) 376–378. [36] G. Winberg, M.L. Hammarskjöld, Isolation of DNA from agarose gels using DEAE-paper. Application to restriction site mapping of adenovirus type 16 DNA, Nucleic Acids Res. 8 (1980) 253–264. [37] J.C. Dunn, A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters, J. Cybernet. 3 (1973) 32–57. [38] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum, New York, NY, 1981. [39] R.M.C.R. de Souza, F.A.T. de Carvalho, Clustering of interval data based on cityblock distances, Pattern Recognit. Lett. 25 (2004) 353–365. [40] F.A.T. de Carvalho, P. Brito, H.-H. Bock, Dynamic clustering for interval data based on L2 distance, Comput. Stat. 21 (2006) 231–250. [41] M.A. Lockard, P. Listwan, J.D. Pedelacq, S. Cabantous, H.B. Nguyen, T.C. Terwilliger, et al., A high-throughput immobilized bead screen for stable proteins and multi-protein complexes, Protein Eng. Des. Sel. 24 (2011) 565– 578. [42] B. Oliva, G. Gordon, P. McNicholas, G. Ellestad, I. Chopra, Evidence that tetracycline analogs whose primary target is not the bacterial ribosome cause lysis of Escherichia coli, Antimicrob. Agents Chemother. 36 (1992) 913–919. [43] L.M. Guzman, D. Belin, M.J. Carson, J. Beckwith, Tight regulation, modulation, and high-level expression by vectors containing the arabinose PBAD promoter, J. Bacteriol. 177 (1995) 4121–4130. [44] D.A. Parsell, R.T. Sauer, The structural stability of a protein is an important determinant of its proteolytic susceptibility in Escherichia coli, J. Biol. Chem. 264 (1989) 7590–7595. [45] X. Cheng, X. Zhang, J.W. Pflugrath, F.W. Studier, The structure of bacteriophage T7 lysozyme, a zinc amidase and an inhibitor of T7 RNA polymerase, Proc. Natl. Acad. Sci. U.S.A. 91 (1994) 4034–4038. [46] D. Jeruzalmi, T.A. Steitz, Structure of T7 RNA polymerase complexed to the transcriptional inhibitor T7 lysozyme, EMBO J. 17 (1998) 4101–4113. [47] J. Sambrook, D.W. Russell, Fragmentation of DNA by Nebulization, Cold Spring Harbor Protocols, 2006. [48] S.O. Garbuzynskiy, D.N. Ivankov, N.S. Bogatyreva, A.V. Finkelstein, Golden triangle for folding rates of globular proteins, Proc. Natl. Acad. Sci. U.S.A. 110 (2013) 147–150.