564
GENETIC
[27] R a n d o m
ANALYSIS
[27]
OF STRUCTURE--FUNCTION
Mutagenesis of Protein Sequences Oligonucleotide Cassettes
Using
B y JOHN F. REIDHAAR-OLsoN, JAMES U. BOWIE, RICHARD M. BREYER, JAMES C. H u , KENDALL L. KNIGHT, WENDELL A. LIM, MICHAEL C. MOSSING, DAWN A. PARSELL, KEVIN R. SHOEMAKER,and ROBERT T. SAUER
Introduction Investigations of protein structure and function often rely on the analysis of mutant proteins. With the advent of methods for rapid and economical chemical synthesis of DNA, there has been a steadily increasing use of oligonucleotide-directed mutagenesis I and cassette mutagenesis z,3 to create specific mutations at particular sites. These synthetic methods can also be used to create random mutations at one or more codons in a gene. In this chapter, we describe and evaluate several methods of cassette mutagenesis that allow random mutations to be generated, and discuss the application of these methods to the analysis of the structure and function of DNA-binding proteins. In addition to the techniques described here, other methods are available for the efficient generation of protein variants. These include approaches based on the suppression of amber mutations 4 and oligonucleotide-directed methods that generate either random changes 5'6 or specific changes. 7 A simple form of cassette-mediated random mutagenesis involves mutating a single codon to encode all 20 naturally occurring amino acids. Individual clones are then isolated from the resulting population of mutants, sequenced, and screened for neutral, conditional, or defective phenotypes. This method can be somewhat labor intensive, but because changes are limited to a single position, the interpretation of the resulting phenotypes is relatively straightforward. As described below, this api M. J. Zoller and M. Smith, this series, Vol. 100, p. 468. 2 S. J. Eisenbeis, M. S. Nasoff, S. A. Noble, L. P. Bracco, D. R. Dodds, and M. H. Caruthers, Proc. Natl. Acad. Sci. U.S.A. 82, 1084 (1985). 3j. A. Wells, M. Vasser, and D. B. Powers, Gene 34, 315 (1985). 4 j. H. Miller, this volume [26]. 5 D. D. Loeb, R. Swanstrom, L. Everitt, M. Manchester, S. E. Stamper, and C. A. I. Hutchison, Nature (London) 340, 397 (1989). 6 j. D. Hermes, S. C. Blacklow, and J. R. Knowles, Proc. Natl. Acad. Sci. U.S.A. 87, 696 (1990). 7 B. C. Cunningham and J. A. Wells, Science 244, 1081 (1989).
METHODS IN ENZYMOLOGY, VOL. 208
Copyright © 1991 by Academic Press, Inc. All fights of reproduction in any form reserved.
[27]
CASSETTE-MEDIATED RANDOM MUTAGENESIS
565
proach is easily extended to the simultaneous mutagenesis of several codons, either in contiguous blocks or in noncontiguous residue positions. In either of these multiple mutagenesis experiments, a biological selection is generally applied to identify active clones, which are then sequenced. Such experiments generate lists of functional sequences, from which one can determine the spectrum of substitutions that are tolerated. An analysis of the resulting substitution patterns is then used to determine the importance of the mutagenized positions and to check for possible combinatorial effects. All cassette mutagenesis procedures involve the synthesis of a small, double-stranded DNA molecule that can be ligated into a larger vector fragment to reconstruct the gene of interest (Fig. 1). The cassette is created by chemical synthesis or a combination of chemical and enzymatic synthesis. The backbone molecule is generated by digestion of plasmid or viral DNA with restriction enzymes. To introduce a cassette into a particular region of a gene requires that appropriate restriction sites occur in the gene sequence. If such sites are not present in the wild-type gene, it is usually possible to modify the sequence in such a way that restriction sites are introduced every 30-40 bases without altering the encoded protein sequence.8'9 In the following sections, we discuss mutagenic strategies and issues that arise in the interpretation of results from random mutagenesis experiments. We then describe techniques for preparation of oligonucleotide cassettes, for preparation of plasmid backbones, for transformation, and for rapid sequencing of mutant genes. Finally, we discuss problems that can arise in cassette mutagenesis experiments. Mutagenic Strategies Genetic Selections and Screens. Efficient _functional selections or screens are extremely important in most randomization experiments, especially when multiple codons are mutagenized. For sequence-specific DNAbinding proteins that can be expressed in Escherichia coli, it is often possible to develop selections in vivo based on antibiotic resistance, l°'j~ 8 M. Nassal, T. Mogi, S. S. Karnik, and H. G. Khorana, J. Biol. Chem. 262, 9264 (1987}. 9 A number of computer programs are available that identify sites in gene sequences at which recognition sites for restriction endonucleases may be introduced without altering the protein sequence. One such program, SeqSearcher for the Apple Macintosh computer, is available from Jim Hu, Dept. of Biology, Room 16-833, Massachusetts Institute of Technology, Cambridge, MA 02139. ~0M. C. Mossing, J. U. Bowie, and R. T. Sauer, this volume [29]. H S. J. Elledge, P. Sugiono, L. Guarente, and R. W. Davis, Proc. Natl. Acad. Sci. U.S.A. 86, 3689 (1989).
566
[27]
GENETIC A N A L Y S I S OF S T R U C T U R E - F U N C T I O N
,l
O
os
~
"5
.0< >.
c 0
g~
~"
..o.o
g
o~
~E
-~ 0
mE
m
~
c
<._~
[27]
CASSETTE-MEDIATED RANDOM MUTAGENESIS
T
C
A
567
G
T
TTC Phe TCC -n-G Leu I TCG Ser
TAC Tyr TGC Cys TAG Stop TGG Trp
C
CTC Leu CTG
CCC Pro CCG
CAC His CAG Gin
CGC Arg CGG
A
A T C lie ATG Met
ACC Thr ACG
AAC Asn AAG Lys
AGC Ser AGG Arg
G
I GTC Val GTG
GCC Ala GCG
GAC Asp GAG Glu
GGC Gly GGG
FIG. 2. Genetic code with third codon position restricted to G or C. All of the codons shown are accessible in an NN~ randomization. Further restrictions are also possible. For example, an NTc~ randomization would restrict the available codons to the first column.
Colony screens based on repression of lacZ or galK expression are also straightforward in many cases. In interpreting phenotypes based on any selection or screen, it is important to know what level of activity is required. If the stringency of the selection or screen is very high, then small reductions in activity or level may result in a conditional or even defective phenotype, and fewer residue substitutions will be tolerated. If the stringency is lower, then more substitutions would be expected to be classified as functionally neutral. In our experience, a selection that requires approximately 10% of wild-type activity provides a useful metric that allows the importance of residue positions to be established. Codon Restrictions. In codon randomization experiments, the position(s) of interest can be changed to either a limited set or a complete set of the 20 naturally occurring amino acids. Complete randomization at the codon level can be achieved by an NNN randomization (where N indicates a mixture of A, C, G, and T). However, as shown in Fig. 2, the third codon position can be restricted to G or C without eliminating any of the amino acids, and thus an NN~ randomization is sufficient to achieve complete randomization with respect to possible amino acid sequences. This restriction is useful both because it reduces the overall DNA sequence complexity and because it reduces the coding discrepancy between residues like Met and Trp, which have a single codon, and residues like Leu, Arg, and Ser, which have six codons in an NNN randomization but only three codons in an NNcc randomization. Randomizations can also be restricted to specific regions of the genetic code. For example, a randomization using
568
GENETIC ANALYSIS OF STRUCTURE-FUNCTION
[27]
NTcG would be restricted to the five hydrophobic amino acids Phe, Leu, Ile, Met, and Val, whereas an NAcG randomization would be restricted to the seven relatively hydrophilic side chains Tyr, His, Gln, Asn, Lys, Asp, and Glu. In either complete or limited randomizations of this type, equal quantities of each desired base are generally included. Thus in unselected populations, one Would expect to recover each residue at frequencies that were roughly proportional to the number of codons for that residue. However, random mutagenesis can also be performed using unequal quantities of each base, 12.13in order to bias recoveries toward a specific residue, which is often the wild type. For example, a "biased" randomization might include 70% of the wild-type base at each codon position and 10% of each of the other bases. Single-Codon Randomizations. Mutagenesis of single-codon positions is the most basic kind of cassette randomization experiment. In general, one uses an unbiased NN~ randomization to generate all of the possible amino acids and then analyzes the sequences and phenotypes of clones from the resulting population of variants. A good mixture of different sequences can generally be obtained simply by sequencing unselected clones. The phenotypes conferred by each different sequence can then be determined. Alternatively, it may be desirable in some cases to screen clones for phenotype first and then to sequence several candidates from each phenotypic class. How many unselected clones from an NNc~ randomization need to be sequenced to recover a given number of different amino acids? Monte Carlo simulations (assuming a random distribution at the nucleotide level) show that sequencing 30 candidates generally results in the identification of 14-16 residues, whereas sequencing almost 100 candidates is necessary to have a greater than 50% chance of recovering all 20 amino acids. 14 For most studies, there is no need to analyze all possible substitutions; enough different changes are recovered by sequencing 20-30 candidates to evaluate the importance of the residue being studied. When recovery of a complete set of substitutions is desirable, it is often easiest to sequence approximately 50 clones, and then make the remaining 1 or 2 substitutions by conventional site-directed mutagenesis using a cassette. Randomization of Multiple Codons. Simultaneous randomization of more than one codon is useful because many positions can be examined in a single experiment. Moreover, potential interactions among residues can be tested. However, in randomization experiments involving multiple t2 A. R. Oliphant, A. L. Nussbaum, and K. Struhl, Gene 44, 177 (1986). t3 D. E. Hill, A. R. Oliphant, and K. Struhl, this series, Vol. 155, p. 558. t4 j. F. Reidhaar-Olson and R. T. Sauer, Proteins: Struct. Funct. Genet. 7, 306 (1990).
[27]
CASSETTE-MEDIATED RANDOM MUTAGENESIS
569
codons, it is rarely desirable to try to recover or analyze all of the resulting sequences. This is true both because of numerical complexity and because analysis of multiply mutant sequences that confer a defective phenotype is generally uninformative. As a consequence, the analysis is usually limited to active sequences that are identified by using a biological selection for activity. Screening among unselected candidates can also be used to find active clones, but can be tedious if only a small number of the randomly generated sequences are active. In multiple codon randomizations, it is often desirable to limit the number of codons being mutated or the extent of randomization to ensure that some active sequences are recovered. If an NN~ randomization is applied to 5 codons, then there are 205 or 3,200,000 different amino acid sequences generated. The complexity at the level of nucleic acid sequences is even greater. In principle, with a sufficiently powerful selection, one could recover any active sequence (including wild type). In practice, however, the total library of transformants may be smaller (e.g., 105-106) than the sequence complexity and thus there is no guarantee that any active sequences will be recovered. By contrast, if the NNcGrandomization were restricted to 3 codons, then the amino acid sequence complexity would be only 8000, and there would be an excellent chance of recovering all active sequences from the mutagenized library. If the goal of a study is a comprehensive survey of all sequences, then it is probably best to limit complete randomizations to three or at most four codons. Sequence complexity in multiple codon randomizations can also be limited by restricting the randomization to a subset of the amino acids or by biasing the randomization toward the wild-type residue (see section on codon restriction). Restricting the randomization is useful if the subset of residues introduced at each position (these subsets can be different for different codons) provide useful information. For example, in one study of hydrophobic packing in the core of X repressor, a randomization of 3 positions was restricted to 5 hydrophobic residues, allowing a comprehensive survey of the 125 possible residue combinations.15 In biased randomizations, the extent of mutagenesis can be controlled by adjusting the level of mutant bases. For example, in studies of the Arc repressor, t6 blocks of approximately 10 adjacent codons were mutagenized by contaminating the wild-type nucleotide at each position with 7.5% of each of the other 3 bases. At this level of mutagenesis, each codon has about a 60% chance of encoding the wild-type residue and about a 40% chance of encoding a mutant amino acid. Hence, the mutagenized population should contain 15 W. A. Lim and R. T. Sauer, J. Mol. Biol. 219, 359 (1991). t6 j. U. Bowie and R. T. Sauer, Proc. Natl. Acad. Sci. U.S.A. 86, 2152 (1989).
570
GENETIC ANALYSIS OF STRUCTURE--FUNCTION 160
-
140
-
120
-
[27]
'10 Q)
•-¢
•
0
• o ,-" Q)
100
0 -
O +
0"
"B .~
0
80
+
0
•
0
•
SO 40
0
•
0
• •
0
•
0
+
+ +
+
+
0 0
20
+
0
•
+
0
•
E z
0
0
0 +
+
+
+
+
i
i
i
i
i
i
i
I
i
i
2
4
6
8
10
12
14
16
18
20
Number of residues allowed
FIG. 3. Number of sequences required to observe all allowed residues. Monte Carlo simulations of NN~ random mutagenesis experiments were performed to find the number of sequences that need to be determined to observe all allowed residues 50% of the time (+), 80% of the time (O), and 95% of the time (0). For example, if one determines 50 sequences and observes 10 different residues at a randomized position, then one knows with about 80% confidence that all allowed residues have been recovered. With 10 observed residues, about 75 sequences would be required to increase the confidence level to 95%.
occasional wild-type sequences (at a frequency of approximately 0.6%) but most genes should contain an average of four amino acid substitutions in the mutagenized region. Substitutions resulting from "biased" randomizations will be somewhat limited because single base changes are more probable than double changes, and triple changes are extremely rare. In practice, however, enough substitutions can generally be recovered to assess the general requirements for function. In analyzing functional variants, how does one decide that enough sequences have been determined to be confident that all tolerated substitutions have been identified? As shown in Fig. 3, the number of sequences that need to be determined depends both on the number of residues allowed and the desired confidence limits. Nevertheless, several points are clear. For example, if 20 sequences are determined and only 1 or 2 residues are recovered at a given position, then it is quite likely that other residues would be nonfunctional. However, if in the set of 20 sequences, 10 different residues are recovered, then sequencing more candidates is likely to result
[27]
CASSETTE-MEDIATED RANDOM MUTAGENESIS
571
in additional allowed residues. In general, if a large number of different side chains are observed to be tolerated at a given position, then it is probably dangerous to assume that the residues that have not been recovered are nonfunctional. However, it is generally not necessary to know with certainty that all allowed changes have been recovered. The simple fact that a large number of chemically dissimilar side chains allow function suggests that the position under study cannot be critical for function. In any experiment where more than one codon is randomized, it is possible that some of the functional substitutions recovered at one position depend on simultaneous changes at other positions in the sequence. In such cases, the observed substitutions might not be tolerated as single substitutions in otherwise wild-type backgrounds. Such examples are most common in the case of interacting residues. For example, when three or four interacting residues in the hydrophobic core of the N-terminal domain of h repressor were randomized together, residue substitutions were found to be allowed that were known not to be tolerated as single substitutions. 17
Applications and Analysis
Identification of A l l o w e d Substitutions. Cassette randomization experiments are extremely useful in identifying the importance of residues in a protein sequence. Because the method of mutagenesis ensures that a complete ensemble of sequences is present in the initial randomized population, many different substitutions should be allowed if the chemical identity of the side chain is not important for function. In contrast, if only the wild-type residue or a small set of related side chains is recovered among the functional sequences, then it is reasonable to infer that most other substitutions would result in a defective phenotype. Figure 4 shows residue substitutions that are functionally allowed in the helix 1 region of the N-terminal domain ofh repressor. ~4Some positions tolerate many chemically dissimilar side chains, suggesting that these positions play little or no role in either structure or function. Other positions are restricted to a single residue or a small set of chemically similar side chains. In the crystal structure of the N-terminal domain bound to operator DNA, 18the e-amino group of Lys-19 and the hydroxyl of Tyr-22 make direct contacts with the phosphate backbone of the operator, indicating a direct role of these side chains in DNA binding. In addition to its role in DNA binding, the ring portion of Tyr-22 forms part of the hydrophobic core of the protein and thus serves a structural role as well. Ala-15, 17 W. A. L i m and R. T. Sauer, Nature (London) 339, 31 (1989). 18 S. R. Jordan and C. O. Pabo, Science 242, 893 (1988).
572
[27]
GENETIC ANALYSIS OF STRUCTURE--FUNCTION
Arg
Arg Lys Gin Glu
Arg Lys Glu
Asp
Gin Glu His Ser
Arg
Ser Thr
Ala
Lys Gin
Gly Ala
Met
Ser
Arg
Ser
Gly Ala Met
Leu
Asp
Ala
Met
Lys
Leu
Lys
Leu
Leu lie
I
I
I
I
I
I
I
I
J
-- GIu--Asp-13 14 s
b
Met
Ala--Arg--Arg--Leu--Lys--
Ala--
Tyr
Leu Val
I
I
lie - - T y r - - G l u - -
15
16
17
18
19
~0
21
22
23
b
s
s
b
s
s
b
b
s
I
I
saltbridge
I
DNA contact
I DNA contact
FIG. 4. Allowed substitutions identified from random mutagenesis experiments. The allowed residues in a helical region of the N-terminal domain of h repressor are shown above the wild-type sequence. The numbers below the wild-type sequence indicate residue numbers. An s below the sequence indicates that the wild-type residue is on the surface of the protein; a b indicates that the residue is buried (0-25% fractional side chain accessibility). The side chains of Asp-14 and Arg-17 interact through a salt bridge. Lys-19 and Tyr-22 make contacts with the operator DNA. In general, residues that are either structurally or functionally important are conserved.
Leu-18, and Ile-21 also form part of the hydrophobic core, while Asp-14 and Arg-17 form a charge stabilized hydrogen bond. Hence, these residues are likely to be important for maintaining the folded structure of the protein. Discriminating between Structural and Functional Effects. The examples discussed above show that allowed substitutions can be restricted because the residue mediates contact with the D N A or is important for protein folding or stability. In the absence of structural information, these effects can usually be distinguished by biochemical studies of purified proteins.19'2° However, it is sometimes possible to eliminate mutations that cause structural defects if more than one phenotype that depends on structure can be monitored. For example, in studies of the positive control function of h repressor, randomized candidates were first selected for their ability to mediate repression (thereby ensuring that the mutant proteins could fold and bind DNA) and then screened for a positive control phenotype. 21 This allowed the identification of a single residue that was critical I9 M. H. Hecht, J. M. Sturtevant, and R. T. Sauer, Proc. Natl. Acad. Sci. U.S.A. 81, 5685 (1984). z0 H. C. M. Nelson and R. T. Sauer, J. Mol. Biol. 192, 27 (1986). 21 F. D. Bushman, C. Shang, and M. Ptashne, Cell (Cambridge, Mass.) 58, 1163 (1989).
[27]
CASSETTE-MEDIATED RANDOM MUTAGENESIS
573
for positive control. As discussed below, strategies involving antibody screens or intracellular proteolysis can also be used to distinguish structural and functional effects. Proteolysis and Structural Defects. Thermally unstable variants of many DNA-binding proteins, including the N-terminal domain of h repressor, h Cro, and P22 Arc, are subject to rapid degradation in E. coli. 22-25 As a result, variants bearing destabilizing substitutions are generally present at intracellular levels lower than wild type, and the presence of normal levels of a mutant protein in vivo is often indicative of thermal stability. For example, in studies of the Arc repressor, blocks of approximately 10 codons at a time were subject to a "biased" randomization and unselected colonies were screened by sodium dodecyl sulfate (SDS) gel electrophoresis for Arc protein levels. 16The arc genes from colonies displaying moderate to high levels of protein were sequenced, and the corresponding proteins were purified and studied by circular dichroism to confirm that they were stably folded. These studies resulted in a list of "structurally" tolerated substitutions. By comparing this list with that of functionally tolerated substitutions, it was possible to distinguish positions likely to be directly involved in DNA recognition from those likely to be involved in stabilization of structure. Although proteolysis screens can be used in a preliminary way to identify structurally stable proteins, the absence of a mutant protein in a cell lysate does not always indicate structural instability. For example, in a study of the N-terminal domain ofh repressor, the five C-terminal codons of the gene were subjected to a complete NNcG randomization and colonies were screened by SDS gels for protein levels, z6 Candidates found to have extremely low levels of the N-terminal domain (and to be rapidly degraded had hydrophobic C-terminal pentapeptides. By contrast, candidates with high levels of protein had hydrophilic C-terminal pentapeptides. While these results provided clear evidence that hydrophobic residues at the C terminus resulted in proteolytic instability, the Tm of one of the most rapidly degraded variants was found to be identical to wild type. Hence, proteolytic instability and thermal instability are not always correlated. This example illustrates that purification and biochemical analysis of mutant proteins are ultimately required to understand the effects of particular mutations. 22 D. A. Parsell and R. T. Sauer, J. Biol. Chem. 264, 7590 (1989). 23 j. U. Bowie and R. T. Sauer, J. Biol. Chem. 264, 7596 (1989). 24 A. A. Pakula, V. B. Young, and R. T. Sauer, Proc. Natl. Acad. Sci. U.S.A. 83, 8829 (1986). 25 A. A. Pakula and R. T. Sauer, Proteins: Struct. Funct. Genet. 5, 202 (1989). 26 D. A. Parsell, K. R. Silber, and R. T. Sauer, Genes Dev. 4, 277 (1990).
574
GENETIC ANALYSIS OF STRUCTURE--FUNCTION
[27]
Observed Densities of Restricted Sites in DNA-Binding Proteins. In the 92-residue N-terminal domain of h repressor, 60 residues have been studied by random mutagenesis. 14,17,21,27-30Of these residues, roughly onehalf exhibit highly restricted substitution patterns, and thus play important roles of some kind in repressor structure and function. Many of the important residues are buried in the hydrophobic core or the dimer interface of the protein. Of the remaining restricted positions, some are involved in hydrogen bonds or salt bridges, some are directly required for DNA binding, and one is essential in protecting the protein from intracellular proteolysis. In the P22 Arc repressor, the entire 53 residues of the protein have been characterized in a "biased" randomization study.16 Here, the identities of approximately one-third of the residues were found to be functionally important, and one-half were structurally important. The remaining residues could be freely substituted and thus are unimportant for either structure or function. Interplay of Single-Codon and Multiple-Codon Randomizations. It is often useful to apply single-codon and multiple-codon randomizations in a sequential fashion, as the two methods can provide different types of information. For example, in mapping antibody-binding epitopes on the surface of h repressor, a region of the gene was subjected to a multiplecodon "biased" randomization and candidates displaying repressor activity and binding to a conformation-specific monoclonal antibody were identified. 28 Several surface positions were found to be invariant in the antibody-reactive clones, suggesting that the side chains at these positions played important roles in antibody binding. This was confirmed by performing single-codon NNcc randomizations on each of these positions, sorting the resulting mutants into reactive and nonreactive classes, and then purifying representative mutants and measuring antibody affinities directly. In this case, the multiple-codon randomization provided a rapid way to identify important positions, while the single-codon randomizations provided detailed information about the sequence requirements at each of these positions. Multiple-codon randomizations can also be performed after singlecodon studies, as a test of the additivity of allowed substitutions. For example, in studies of leucine zipper function, single-codon randomizations were first performed to identify allowed substitutions at four of the 27 j. F. Reidhaar-Olson and R. T. Sauer, Science 241, 53 (1988). 28 R. M. Breyer and R. T. Sauer, J. Biol. Chem. 264, 13355 (1989). 29 j. F. Reidhaar-Olson, D. A. Parsell, and R. T. Sauer, Biochemistry 29, 7563 (1990). 30 N. D. Clarke and C. O. Pabo, manuscript in preparation.
[27]
CASSETTE-MEDIATED RANDOM MUTAGENESIS
575
conserved leucine positions. 31 In these experiments, other hydrophobic residues such as Val, Ile, and Met were found to be tolerated at each of the four positions. However, when all four of these leucine positions were randomized simultaneously, it was found that each of the functional sequences contained at least two and usually three leucines, suggesting that many of the substitutions that were tolerated singly could not be tolerated together. In this case, the multiple-codon randomization revealed combinatorial effects that could not be identified in the single-codon randomizations. Random Mutagenesis Techniques
Synthesis of First Strand of Cassette Oligonucleotides can be conveniently prepared using any of a number of commercially available DNA synthesizers. On some synthesizers, the machine can be programmed to deliver variable amounts of any of the four bases during any coupling step. If this is not the case, then it is generally necessary to have additional bottles with mixtures of bases (e.g., equal amounts of A, C, G, and T). These bottles may then be attached to extra delivery ports on the synthesizer, if available. If the machine has capacity for only four bottles of nucleotide solutions, it will be necessary to interrupt the synthesis and replace one of the bottles with the base mixture at the appropriate step of the synthesis. Following removal of the final product from the solid support, the oligonucleotide may be of sufficient purity to use directly. However, gel purification on a 12-20% (w/v) polyacrylamide/ urea gel 32 is often desirable in order to remove incomplete DNA fragments that accumulate during the course of the synthesis. Following electrophoresis, the appropriate band is located by UV shadowing or staining with ethidium bromide, and excised. The oligonucleotide is then eluted from the gel slice and can be further purified (e.g., by passage over a reversedphase C,8 Sep-Pak column from Waters Associates, Milford, MA).
Chemical Synthesis of the Complementary Strand Figure 5A and B shows two possibilities for chemical synthesis of the second strand of the cassette. In the first, the same NNcG random mixture is included in both strands at the codon or codons being randomized (Fig. 5A). In theory, mismatches generated at the randomized codon could lead 31 j. C. Hu, E. K. O'Shea, P. S. Kim, and R. T. Sauer, Science 250, 1400 (1990). 32 F. M. Ausubel, R. Brent, R. E. Kingston, D. D. Moore, J. G. Seidman, J. A. Smith, and K. Struhl, "Current Protocols in Molecular Biology." Wiley, New York, 1989.
A. Random bases on both strands A 5'
NNN IIIII NNN
I I I I I I I
3'
3' I
5'
B
B. Inosine pairing A 5' 3'
b t r i l f l i l i l lNNN llt llI
3'
pS'
, B
C. Enzymatic second-strand synthesis
A
B NNN'
J
3'
11Anneal
i
NNN,
5'
B
i
lit3'
3' i
NNN
i
s'
B Extend with DNA polymerase
A 3'
B
A
NNN
NNN
3'
"N N N
NNN'
5'
/
~ Digest with restriction enzymes A and B A 5'
NNN
3'
I l l l l l l l l l l l l l l l l l l l l
3'~NN
N
~
~5' B
[27]
CASSETTE-MEDIATED RANDOM MUTAGENESIS
577
to inefficient pairing or problems with heterogeneity following transformation. However, in practice this method appears to be completely satisfactory, in both efficiency and randomness, when codons are randomized individually. We have not tested the technique extensively with multiple randomizations. An alternative method (Fig. 5B) uses inosine in the second strand opposite the randomized positions in the first strand. Since inosine is able to base pair with each of the four standard bases,33 such cassettes are able to anneal and ligate with high efficiency. As discussed later, this procedure occasionally introduces some bias toward particular bases, but appears to be sufficiently random in most instances. This technique has been used successfully to randomize as many as three residue positions simultaneously. 27 Procedure. Following synthesis and purification, each strand of the cassette is diluted to a concentration of 0.5/zM in 40/~1 kinase buffer [50 mM Tris-HCl (pH 7.5), 10 mM MgCI2, 5 mM dithiothreitol, 0.1 mM spermidine, 0.1 mM EDTA, 2 / z M ATP], and phosphorylated using 20 units of T4 polynucleotide kinase (New England Biolabs, Beverly, MA). Phosphorylation is carried out for 1.5 hr at 37°, followed by 10 min at 65 ° to inactivate the kinase. (As discussed later, phosphorylation of the oligonucleotides is particularly important when using the inosine method.) The two oligonucleotides are then annealed by mixing at 0.2/zM in 40/zl annealing buffer [10 mM Tris-HCl (pH 8.0), 10 mM MgC12], heating at 80° for 10 min, and cooling slowly to room temperature.
Enzymatic Second Strand Synthesis When both strands of the mutagenic cassette are prepared by chemical synthesis, there is always the potential problem of noncomplementarity due to mismatch formation during the annealing step. This can be avoided 33 F. H. Martin, M. M. Castro, F. Aboul-ela, and I. Tinoco, Jr., Nucleic Acids Res. 13, 8927 (1985).
FIG. 5. Three strategies for construction of the complementary strand of a mutagenic cassette. (A) Random bases are included on both strands of the cassette during synthesis. (B) Inosines are included on the complementary strand opposite the randomized positions on the first strand. (C) The complementary strand is synthesized enzymatically to prevent mismatches in the cassette. In this case, the first strand is synthesized with a self-complementary region at its 3' end corresponding to restriction site B. Following extension with DNA polymerase, the double-stranded molecule is digested with restriction enzymes A and B to yield the final cassette. An italicized N indicates an enzymatically-inserted base that is complementary to the random base on the opposite strand. Throughout this figure, randomized codons are indicated as NNN for simplicity; in practice, complete randomization may be accomplished by using NNc~ .
578
GENETIC ANALYSIS OF STRUCTURE--FUNCTION
[2 7]
by performing enzymatic second strand synthesis, using the first, randomized strand as a template as shown in Fig. 5C. 12'13 The first strand is synthesized with a self-complementary 3' end that contains the recognition sequence for one of the two restriction enzymes. The recognition sequence for the other restriction enzyme is included near the 5' end. Extension of the oligonucleotide is performed using the Klenow fragment of DNA polymerase I. The result of the self-primed DNA synthesis is a doublestranded oligonucleotide that contains no mismatches. Digestion with the two restriction enzymes yields a cassette with the appropriate ends for insertion into the plasmid backbone. Efficient digestion requires the presence of several additional base pairs beyond the restriction sites at the ends of the cassette; for most enzymes, three additional bases at the 5' end of the oligonucleotide appear to be sufficient. The enzymatic method is more involved than either of the chemical methods for second strand synthesis, and can lead to higher levels of unwanted mutations at cassette positions that have not been mutagenized (see Bonus Mutations, below). However, the enzymatic method does not appear to skew randomizations toward particular bases, does not introduce potential problems due to mismatched bases, and is almost certainly the best method to use in "biased" randomizations. Figure 6 shows a variation on the enzymatic approach that allows the synthesis of larger mutagenic cassettes. 15In this case, two oligonucleotides are synthesized, each covering half the distance between the two restriction sites, with a complementary 9-base overlap at their 3' ends. Each oligonucleotide contains one of the two appropriate restriction sites plus several additional bases at its 5' end. The oligonucleotides are annealed and filled in, each priming second strand synthesis of the other. Digestion with the two enzymes yields the final randomized cassette with the appropriate ends for cloning into the plasmid backbone. Any of the codons between the two sites may be randomized. This method has been used to randomize four codons combinatorially, the most distant of which were I00 bp apart in the sequence. 17 Procedure. For enzymatic second strand synthesis, we use approximately 5/zg of the template oligonucleotide. Following annealing, extension is performed in 50 p~l of a solution containing 10 mM Tris-HCl (pH 7.5), 10 mM MgCI2, i00 p,g/ml bovine serum albumin, 50 m M NaC1, and 250/.~M dNTPs using 10 units of the Klenow fragment of DNA polymerase I (Boehringer-Mannheim, Mannheim, Germany) or Sequenase (United States Biochemical, Cleveland, OH). The Sequenase enzyme has higher processivity but also exhibits a higher misincorporation rate (see below). After 1 hr at 37°, 5/zl of 2.5 m M dNTPs, 0.5/zl of 0.5 M dithiothreitol,
[27]
CASSETTE-MEDIATED RANDOM MUTAGENESIS A 5'
3' t
579
C
'
NNN
'
NNN
j
'3'
,
,
C'
5'
B
I Anneal
i
5'
i
NNN
C iii
i 3'
3'
NNN
C' Extend with DNA polymerase
A NNN NNN
5'11
3'
N N IV NNN'
IIII
IIIIIIIIIII
Illllllllllllllllllllll
III
I 3'
~
5'
B
Digest with restriction enzymes A and B
A 5'
NNN ,111111111 3 ~NNIV
I
II
II
NNN NNN'
111111111111111111111111
13' ~--S 5' B
FIG. 6. A strategy for randomizing multiple codons distant in the sequence. Two oligonucleotides are synthesized that together encode the region between restriction sites A and B. The first oligonucleotide corresponds to the top strand, the second oligonucleotide corresponds to the bottom strand. The two oligonucleotides are synthesized with a complementary 9-base overlap at their 3' ends. Each molecule serves as a template for extension of the other by DNA polymerase. Digestion with restriction enzymes A and B yields the final, full-length cassette.
and 10 units of enzyme are added, and the reaction is allowed to proceed at 37 ° for an additional hour. The DNA is then ethanol precipitated, and approximately 10/xg of the double-stranded oligonucleotide is digested with 40 units of each restriction enzyme at 37 ° overnight. Extension and digestion reactions are monitored by running the DNA on 6% (w/v) denaturing polyacrylamide gels. The final extended and digested cassettes are purified on 8% nondenaturing gels.
580
GENETIC ANALYSIS OF STRUCTURE-FUNCTION
[27]
Preparation of Plasmid Backbone If plasmids containing the wild-type gene (or any active gene) are used to prepare a backbone fragment, then the main consideration is preventing significant contamination of the backbone fragment by uncut or singly cut plasmid molecules. Such contamination will give rise to an unwanted background of unrandomized clones. This can be an especially serious problem in multiple codon randomizations, where only a small fraction of the randomized sequences may be active. A simple strategy that allows contamination due to uncut or singly cut molecules to be detected is to introduce a silent mutation at one of the nonrandomized positions in the cassette to a degenerate codon. When candidates are sequenced, contaminants can be readily identified because they will have the wildtype DNA sequence. Contamination with uncut or singly cut molecules can be reduced by overdigestion with restriction enzymes or by careful gel purification of the backbone fragment, but these precautions are not always sufficient. The best solution to this problem is to purify the backbone from a plasmid that contains a large "stuffer" fragment cloned between the restriction sites to be used for insertion of the mutagenic cassette (for an example of the use of "stuffer" fragments, see Ref. 27). The stuffer fragment serves two purposes. First, it disrupts the coding sequence, which should result in inactivation of the gene. This ensures that any functional genes recovered following random mutagenesis result from the insertion of oligonucleotide cassettes rather than from backbone reclosure. Second, if the stuffer fragment is fairly large, its excision following digestion with restriction endonucleases leads to a significant change in mobility during gel electrophoresis. Consequently, the plasmid backbone can be readily purified away from uncut or singly cut plasmid DNA. Procedure. Plasmid DNA that has been purified by CsC1 gradient centrifugation is the most reliable for digestion with restriction enzymes, although DNA purified by minipreparation procedures 32 is often of sufficient purity. In typical experiments, 5/zg ofplasmid DNA is digested with 5-10 units of each restriction enzyme for I hr at 37°. An additional 5-10 units of each enzyme is added, and the incubation is continued for 1 hr. Restriction fragments are separated by electrophoresis on low melting point agarose gels. DNA fragments are isolated from gel slices by any of several methods, including the Qiagen Cartridge (Qiagen, Inc., Studio City, CA), Elutip (Schleicher & Schuell, Heane, NH), and Gene Clean (Bio 101, Inc., La Jolla, CA) protocols. Alternatively, the slice may be melted, phenol extracted, and the DNA purified by ethanol precipitation. 32 Double-stranded oligonucleotide cassettes are ligated to plasmid back-
[27]
CASSETTE-MEDIATED RANDOM MUTAGENESIS
581
bone overnight at 4-14 ° in ligation buffer [50 mM Tris-HC1 (pH 7.5), 10 mM MgCI 2 , 10 mM dithiothreitol (DTT), 1 mM spermidine, 1 mM ATP, 100 ~g/ml bovine serum albumin]. Ligation reactions are performed at roughly equimolar concentrations of insert and plasmid backbone.
Transformation Since the goal of a random mutagenesis experiment is usually to sample as much of sequence space as possible, the number of residue positions that can be examined in a single experiment rapidly becomes limited by the efficiency of each step in the procedure, and especially by the efficiency of transformation. We have found, using the transformation protocol of Hanahan, 34 that it is possible to obtain 105-106 transformants from 100 ng of DNA. This is enough to ensure that most of sequence space is sampled in experiments in which one to three residue positions are randomized at a time. However, randomizing more than three positions generally requires scaling up the transformation procedure. An alternative is to use electroporation, 3z which gives transformation efficiencies about 100-fold greater than Hanahan transformation.
Sequencing Since random mutagenesis generates a large number of mutant genes, the ability to rapidly sequence these genes is essential. We have found it convenient to perform mutagenesis using plasmids bearing an M13 origin of replication to allow production of single-stranded plasmid DNA. 35 This DNA is then sequenced using the dideoxy method. 36 An alternative is to isolate and sequence the double-stranded plasmid D N A . 37"38 We typically perform 48, 72, or 96 sets of dideoxy-sequencing reactions at a time. Sequencing this many candidates at once is facilitated by the use of microtiter dishes for all of the sequencing reactions. Sequencing reagents may be rapidly dispensed into the microtiter wells using a repeating pipette. The labeling and extension reaction is performed in one row of the dish, and then aliquots of each reaction are transferred using a multichannel pipette to separate wells for the A, C, G, and T termination reactions. Sequencing kits are available (e.g., from Amersham, Arlington Heights, IL) with reagents already dispensed into microtiter wells. The 34 D. 35 R. 36 F. 37 E. 38 D.
Hanahan, J. Mol. Biol. 166, 557 (1983). J. Zagursky and M. L. Berman, Gene 27, 183 (1984). Sanger, S. Nicklen, and A. R. Coulson, Proc. Natl. Acad. Sci. U.S.A. 74, 5463 (1977). Y. Chen and P. H. Seeburg, D N A 4, 165 (1985). Seto, Nucleic Acids. Res. 18, 5905 (1990).
582
GENETIC ANALYSIS OF STRUCTURE-FUNCTION
[27]
use of double-fine sharkstooth combs on 34-cm wide gels allows 24 sets of sequencing reactions to be loaded per sequencing gel. Procedure. To prepare single-stranded DNA for sequencing, a 1.5-ml aliquot of 2 x YT 39 containing 1.5 × 107 MI3 RV-1 helper phage 4° is inoculated with 30/~l of a fresh overnight culture. The cells are grown aerobically in a roller drum and infection is allowed to proceed for 6 hr at 37°. The cultures are then transferred to 1,5-ml Eppendorf tubes and centrifuged for l0 min. The full 10-min centrifugation is necessary to pellet cell debris that can otherwise interfere with the DNA sequencing. A 1.2-ml portion of the supernatant, containing the phage, is transferred to an Eppendorf tube containing 300 /zl of 2.5 M NaCl in 20% polyethylene glycol (Mr 8000). After 30 min at room temperature, this solution is centrifuged for 10 min at 4 ° to pellet the phage. The supernatant is poured off, the tubes are briefly centrifuged again, and the residual supernatant is removed by aspiration. It is crucial that all the polyethylene glycol be removed, since it will inhibit the sequencing reactions. The phage pellet is suspended in 100/zl TES [20 mM Tris-HCl (pH 7.5), 10 mM NaC1, 0.1 mM EDTA], a 50-/~1 aliquot of phenol is added, and the tubes are vortexed for 30 sec. After 5 min at room temperature, the tubes are again vortexed for 30 sec, and then centrifuged for 10 min at 4 °. An 80-gl portion of the upper, aqueous supernatant is removed and added to 4/zl of 3 M sodium acetate. The single-stranded DNA is precipitated by addition of 200/xl ethanol. The DNA is pelleted by centrifugation for 15 min at 4 ° and rinsed with 200/.d 70% ethanol. Following another 5-rain centrifugation, the DNA is dried in a Savant (Farmingdale, NY) Speed-Vac and dissolved in 25/~l TES. Eight microliters of this DNA is used for sequencing according to the Sequenase (United States Biochemical) protocol. Using this amount of DNA, 200-300 bases can be sequenced. However, when sequencing very close to the sequencing primer (<50 bases), better results are obtained if the purified single-stranded DNA is dissolved in 12/.d of TES, and an 8-/xl aliquot is used for sequencing. To prepare double-stranded plasmid DNA for sequencing, 20/xl of a fresh overnight culture is added to 2 ml of L broth (LB), 39 and the cells are grown at 37° for 5 hr (it is important to use a freshly saturated culture). A 1.5 ml portion of culture is transferred to a 1.5-ml Eppendorf tube, and the cells are pelleted by centrifugation for 30 sec. The cell pellet is suspended in 300/zl STET [8% sucrose, 0.5% (v/v) Triton X-100, 50 mM EDTA, 50 mM Tris-HC1 (pH 8)], and a 20-/zl aliquot of l0 mg/ml lysozyme 39 j. Miller, "Experiments in Molecular Genetics." Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, 1972. 40 A. Levinson, D. Silver, and B. Seed, J. Mol. Appl. Genet. 2, 507 (1984).
[27]
CASSETTE-MEDIATED RANDOM MUTAGENESIS
583
is added. After 5 min at room temperature, the solution is heated in a boiling water bath for 2 min and then centrifuged for 5 min at room temperature. The pellet is removed with a toothpick, and 200 p,l of 2.5 M ammonium acetate/75% 2-propanol is added. After 5 min at room temperature, the solution is centrifuged for 5 min at room temperature, and the pellet is washed with 200/z170% ethanol. Following another 5-min centrifugation, the pellet is dried in a Savant Speed-Vac, and then resuspended in 20/zl TE [10 mM Tris-HC1 (pH 8), 1 mM EDTA]. To sequence using the double-stranded DNA as a template, the entire sample prepared as described above is used. To the DNA are added 4/zl of 16 ng//xl sequencing primer and 6/zl of 2 M NaOH. The solution is incubated at 37° for 15 min. The DNA is precipitated by addition of 10/zl of 3 M sodium acetate (pH 5.2) and 200/zl of cold ethanol, and is pelleted by centrifugation for 15 min at 4 °. The pellet is washed with 200/xl of 70% ethanol. Following a 5-min centrifugation, the supernatant is removed and the pellet is dried in a Savant Speed-Vac. The dried pellet is resuspended in 12/xl of sequencing buffer [prepared by mixing 2/zl of 5 x Sequenase buffer (United States Biochemical), 1 /zl of 0.1 M DTT, and 9/~1 of water] and the sequencing is continued according to the Sequenase protocol, beginning with the labeling step.
Methodological Considerations
Sources of Nonrandomness. In using random mutagenesis techniques to study protein structure and function, it is important that the distribution of amino acids at the mutagenized positions be as near to truly random as possible. The nature of the genetic code dictates that the distribution cannot be perfectly random, since different amino acids are encoded by different numbers of codons. However, at the nucleotide level, a random distribution should be possible. In practice, there are several potential sources of nonrandomness to consider. Some of these sources are difficult or impossible to control experimentally. For example, the stability or translation of an mRNA may be sensitive to the presence of particular codons at some positions, causing certain protein sequences to be underrepresented among the functional sequences. Deviations from a random distribution of bases may also arise in the construction of the cassette. If the coupling efficiency during oligonucleotide synthesis is not the same for each of the four bases, there will be an unequal distribution of bases in the final cassette. Bias may also be introduced in the annealing of the two oligonucleotide strands. For example, in the inosine pairing method, randomized sequences rich in C may pair better than other sequences
584
GENETIC ANALYSIS OF STRUCTURE--FUNCTION
[2 7]
30
o
'-
20
~
10
m
0
A
C
G
Random bases on both strands (n = 184)
T
A
C
G
Inosine pairing (n = 430)
T
A
C
G
T
Enzymatic second strand synthesis (n = 525)
FIG. 7. Base frequencies observed in randomization experiments. The frequency with which each base was observed on the randomized strand is shown, using three different strategies for construction of the complementary strand. In each case, sequences from unselected populations were used, and only data from fully randomized positions were included (i.e., the third position in NNc6 randomizations was omitted). The total number of base sequences determined for each strategy is indicated.
to the inosine-containing strand. 33 As a result, such sequences may be overrepresented in the mutagenized pool. Evaluations o f Base Distributions. To estimate the nucleotide distribution within pools of mutagenized plasmids generated by different methods, we sequenced a number of randomized genes from transformants that were not subjected to a functional selection. Figure 7 shows the frequency of each nucleotide observed at randomized positions, mutated using (1) random bases on both strands of the cassette, (2) the inosine pairing method, or (3) enzymatic second strand synthesis. This figure shows data from fully randomized bases only; that is, the third base in NNc~ randomizations is omitted. All three methods give fairly even distributions of the four bases, although the inosine-pairing method overrepresents C and underrepresents A. The bias toward C in the inosine method probably reflects a slight base-pairing preference.33 If the 5' ends of the oligonucleotide cassette are not phosphorylated prior to ligating into the plasmid backbone (thereby leaving a nick), then this bias toward C can be extreme at positions near the end of the cassette. For example, in one such experiment, C was recovered 100% of the time at the terminal codon, 95% of the time at the penultimate codon, and approximately 70% of the time at
[27]
CASSETTE-MEDIATED RANDOM MUTAGENESIS
585
the next two codons. This may result from exonuclease digestion of the randomized bases near the nick, followed by preferential insertion of C opposite the inosines in the gapped duplex during repair synthesis in oivo. Bonus Mutations. Another problem that may be encountered during construction of the mutagenic cassette is the incorporation of the wrong base at nonrandomized positions. When randomizing positions by including NN~ on both strands or by using the inosine-pairing method, " b o n u s " mutations are observed at a frequency of about 0.03%/base throughout the cassette region. As a consequence, for cassettes of 30-60 bp, roughly I-2% of the sequences are found to contain mutations at nonrandomized positions. These mutations may arise from misincorporation of nucleotide bases during synthesis of the oligonucleotides on the DNA synthesizer. Alternatively, they could reflect chemical modification of bases or failure to deprotect bases completely, with subsequent errors arising during replication in the cell. Another problem is the occurrence of single-base deletions in the cassette region. Such deletions appear in roughly 10% of the sequences from unselected populations. A somewhat higher frequency of bonus mutations is observed when the second strand of the cassette is synthesized enzymatically. In one set of experiments in which the Sequenase enzyme was used to extend the oligonucleotides as shown in Fig. 5C, 30% of the candidates analyzed contained additional mutations (including deletions) at nonrandomized positions. Since the cassettes in these experiments were roughly 100 bp in length, this frequency represents a 0.3% misincorporation rate at each base. Many of these errors probably arise during the extension reaction, both as a consequence of the lack of an editing function in the Sequenase enzyme and because the extension reaction is usually performed at high nucleotide concentrations. Although the Sequenase enzyme is more error prone than the Klenow fragment of DNA polymerase I, it can be used to extend some oligonucleotides that cannot be extended with the Klenow fragment. Hence, there is a trade-off of efficiency versus fidelity when choosing between these two enzymes. Native T7 polymerase has many desirable properties that may help to overcome some of these problems,a1 although we have not had extensive experience with this enzyme. Heterogeneity. In cassette randomization experiments, there is always heterogeneity among the plasmid molecules, prior to transformation into recipient cells. Single plasmids will encode different sequences as a consequence of the deliberate introduction of random bases during construction of the plasmids, and some or even all of the plasmids may bear mismatches as a consequence of the method of second strand synthesis of the muta41 K. Bebenek and T. A. Kunkel, Nucleic Acids Res. 17, 5408 (1989).
586
G E N E T I C A N A L Y S I S OF S T R U C T U R E - F U N C T I O N
[28]
genic cassette. Heterogeneity may persist after transformation if a single cell is transformed with plasmids bearing different sequences or if DNA replication of mismatched strands yields more than one sequence. Fortunately, such examples are rare in our experience. Presumably, multiple transformation events are uncommon and repair in vivo corrects most mismatches before replication can occur. As a consequence, it is possible to use primary transformants for selections, screens, and sequencing in most cases. Nevertheless, the possibility of mixed populations in cells must be kept in mind. It is good practice to restreak candidates to single colonies before single-stranded template DNA is prepared for sequencing, and to retest activity phenotypes after colony purification. Heterogeneity may be indicated as a potential problem if phenotypes are not stable, if independent isolates of the same sequence appear to confer different phenotypes, or if sequencing suggests the presence of more than one base at a given position. In these cases, plasmid DNA should be purified and used to retransform cells before further experiments.
[28] L i n k e r I n s e r t i o n M u t a g e n e s i s as P r o b e o f Structure-Function Relationships By STEPrIEN P. GOFF and VINAYAgA R. PgASAD
The in vitro modification of cloned genes is one of the most powerful methods available for the localization of functional domains of a given gene product. The analysis of the function of mutant gene products, and correlation of the position of mutations with their effects, can quickly permit a determination of the essential regions encoding that function. A variety of mutagenesis techniques, variously generating substitutions, insertions, and deletions, can be used to reveal the structural organization of domains in proteins. Amino acid substitution, readily achieved with oligonucleotides, is perhaps the most commonly used form of mutagenesis. When the structure of a gene is only poorly understood, however, it is difficult and expensive to use this method to make a large library of mutants with changes scattered across the gene. In contrast, linker insertion mutagenesis offers a rapid means of structure-function analysis in the absence of clues suggesting specific site-directed mutations. In linker insertion mutagenesis, short palindromic oligonucleotides containing the recognition sequence for a particular restriction enzyme are inserted at known locations throughout the length of the gene, and the effects of these insertions on protein function and stability are studied. METHODS IN ENZYMOLOGY, VOL. 208
Copyright © 1991 by Acadermc Press, Inc. All rights of reproduction in any form reserved.