REFLECTIONS
TIBS 24 – MAY 1999
Protein sequencing and the making of molecular genetics In the early history of molecular genetics, proteins – not nucleic acids – were at the centre of research. This was the case long after Watson and Crick had proposed their double-helical model of DNA. The reason was simple: proteins could be sequenced; nucleic acids could not. Here, I present a brief history of protein sequencing, charting its development from a tool in protein chemistry to its uses in the study of gene function and the way it informed initial attempts at nucleic acid sequencing. I focus on Sanger’s sequencing work in the Biochemistry Department in Cambridge (UK) and the use of his techniques by molecular geneticists in the nearby Physics Department. Sanger’s work won him a double Noble Prize, and his close interaction with molecular biologists culminated in common plans for a new Laboratory of Molecular Biology on the outskirts of Cambridge.
A new tool Protein sequencing started as part of studies of protein structure and function. From the late 19th century onwards, proteins were associated with the basic functions of life, including heredity, and were the object of intense study and debate. Knowledge of protein structure was seen as the key to studies of protein function and as a step towards the synthetic production of proteins. Various theories regarding protein structure were proposed, and discarded, on the basis of sedimentation studies, amino acid analysis and X-ray studies. They invariably assumed that proteins exhibited a high degree of regularity. Sanger’s sequencing work developed from research into a new method for end-group determination in proteins, which he undertook as a postdoctoral student under Charles Chibnall in the Biochemistry Department in Cambridge during the war. End-group determination was an important tool for the estimation of the number and length of polypeptide chains in proteins, yielding basic information on protein structure. It could be used to identify proteins as well as to test their purity. Many different methods for end-group determination were described in the literature –
a review article published in 1945 listed more than 20 (Ref. 1) – but none yielded reliable results. Acting on Chibnall’s suggestion, Sanger tried fluorodinitrobenzene as a reagent. Organic chemists had tended to steer clear of fluoroderivates because of the latter’s toxicity, but during the war the chemicals were synthesized for research into chemical warfare. Sanger found that fluorodinitrobenzene reacted under much milder conditions than did the generally used chlorine compounds. Furthermore, the dinitrophenylamino acids were stable to the acid hydrolysis used to break down proteins and were bright yellow in solution. This made them amenable to the new method of partition chromatography. Using his new technique, Sanger established that insulin consisted of two chains – not 18 chains, as Chibnall had postulated on the basis of its high freeamino-group content. Exploiting the same technique, Sanger identified short, 4–5-residue, sequences at the N-termini of the two chains of the molecule2. Extending the approach to peptides derived by partial acid hydrolysis, and later by enzymatic hydrolysis, Sanger and his co-workers3,4, in several years of painstaking work, were able to establish the complete sequence of the two insulin chains. Sanger has suggested that end-group determination had already marked a change in the course of protein chemistry from an interest in amino acid analysis to one in the arrangement of amino acids in chains. Moving from there to sequencing did not require a great intellectual leap5. The decisive breakthrough, according to Sanger, was the development of new fractionation techniques by Richard Synge and Archer Martin in the context of research into the composition and characteristics of wool – research that was financed by the International Wool Secretariat6. Sanger’s thesis finds confirmation in the fact that Synge and Martin themselves successfully applied their new fractionation techniques to determination of the structure of the pentapeptide GramicidinS. Their results were published before Sanger presented his first bit of sequence7.
0968 – 0004/99/$ – See front matter © 1999, Elsevier Science. All rights reserved.
Other researchers were active in the field. Pehr Edman8 at the University of Lund in Sweden developed an elegant procedure that was based on the use of phenylisothiocyanate as a reagent and that allowed stepwise degradation of the protein. With the development of more reliable fraction collectors and of sensitive methods for detection of the colourless reaction products, Edman’s method completely superseded Sanger’s sequencing method. In the late 1950s, William Stein and Stanford Moore at the Rockefeller Institute in New York devised an automatic amino acid analyser that yielded quantitative results and facilitated the analytical work. Edman’s procedure, in conjunction with the automatic analyser, allowed researchers to tackle larger proteins. Sanger was interested in sequencing because he felt that it would provide insight into the mechanism of action of insulin. The expectation was that, once the mechanism of action of one protein was known, it would give clues to the functioning of protein hormones and enzymes more generally. When the full sequence of insulin, including the position of the three disulphide bridges, did not give any clue to the protein’s function, Sanger explored new ways of using sequencing to achieve his aim. One avenue he pursued was to identify the ‘active centre’ of insulin by determining and comparing the sequences of homologous proteins from different species. Another approach he followed was to label the active centre and to determine the sequence around it. In the course of their work, Sanger and his collaborators developed sensitive autoradiographic methods and ‘fingerprinting’ techniques that allowed them to deduce the sequences of peptides without carrying out a complete amino acid analysis. Insulin’s mechanism of action proved resilient to all these attacks, but the new methods became important analytical tools for structure determination (Fig. 1). In a review article published in the late 1960s, Brian Hartley, who worked in Sanger’s group, documented the exponential growth in both the speed and volume of sequencing work9. Despite the problems encountered with insulin, researchers expected that protein sequencing would yield insight into the structure and function of proteins. In combination with X-ray analysis, protein sequencing led to the first atomic model of a globular protein (myoglobin) and, some years later, to the first atomic model of an enzyme (lysozyme)10,11; a
PII: S0968-0004(99)01360-2
203
REFLECTIONS which appeared to have the same amino acid composition. In 1957, Crick still considered some possible exceptions to the rule, especially g-globulins and adaptive enzymes. Heuristic reasons, however, made the sequence hypothesis attractive. In his celebrated lecture on protein synthesis, which he delivered before the Society for Experimental Biology, Crick conceded frankly: ‘Our basic handicap at the moment is that we have no easy and precise technique with which to study how proteins are folded, whereas we can at least make some experimental approach to amino acid sequences. For this reason, if for no other, I shall ignore folding in what follows and concentrate on the determination of sequences’13. Thus, Sanger’s analysis not only allowed the formulation of the hypothesis Figure 1 but also provided the experiFred Sanger, holding an autoradiogramme, photographed in his laboratory in the Biochemistry mental tool to test it. Department in Cambridge in the late 1950s. It is worth noticing that, in (Photograph courtesy of F. Sanger.) the same lecture in which Crick for the first time explicreaction mechanism for lysozyme was itly stated the ‘central dogma’ of molsoon proposed. Sequencing also gave ecular biology and defended an ‘inforrise to evolutionary studies of proteins mational’ versus a biochemical view of as a completely new area of research. the problem of protein synthesis, he Finally, protein sequencing was keenly stressed the central and unique imporseized upon by researchers concerned tance of proteins in biology. Crick exwith the molecular mechanism of gene pected that, in contrast to the multiple function. and complex functions of proteins, nucleic acids acted in a ‘uniform and The sequence hypothesis rather simple’ way13. Sanger’s first sequencing results sugBefore testing the sequence hypothgested to Crick – even before 1953 – esis, it was necessary to show that an inthat genes determined the amino acid herited defect was in fact laid down in sequence of proteins (Ref. 12, pp. 34–36). the amino acid sequence of a protein. As Later, Crick expanded this insight in the is well known, Vernon Ingram’s experisequence hypothesis, which stated that ments on sickle-cell haemoglobin, which ‘the specificity of a piece of nucleic acid were performed in the Cavendish labis expressed solely by the sequence of oratory, provided such proof. Refining its bases, and that this sequence is a Sanger’s fingerprinting techniques, (simple) code for the amino acid se- Ingram succeeded in tracking down the quence of a particular protein’13. difference between normal and sickleThe sequence hypothesis was a deci- cell haemoglobin to a single amino acid sive step in early speculation on the residue (Fig. 2)15. genetic code. It boldly assumed that Building on this first success, Crick, the amino acid sequence determined the Brenner and Ingram, together with folding of a protein. A few years before, Seymour Benzer and George Streisinger, Linus Pauling14 had postulated the exist- who gathered in Cambridge in 1957, ence of a gene ‘responsible for the fold- tried to show that the order of muing of polypeptide chains’ to explain tations in a gene lined up with the order the different electrochemical charges of of changes in the amino acid sequence of sickle-cell and normal haemoglobin, the corresponding protein. The original
204
TIBS 24 – MAY 1999 plan was to use Benzer’s finely mapped mutants of the rII region of bacteriophage T4 as a test case. When Benzer failed to isolate the corresponding protein, the group resorted to Streisinger’s bacteriophage-T2 mutants, in which the tips of the tail fibers were affected. The key technique was again fingerprinting, which the group combined with the radioactive marking technique Sanger had pioneered. Radiographic techniques were much more sensitive than other chromatographic techniques. Sanger himself was experimenting with slices of oviducts, which he incubated with radioactive phosphate to get labelled ovalbumin. This was quite a lengthy and laborious procedure. As early as October 1956, Crick wrote to Brenner, who was still in Johannesburg: ‘I stressed to Fred [Sanger] how extremely favorable the phage system might be for this method…He seemed very interested’16. The experiments on the tail-fiber mutants did not yield conclusive results and were abandoned. Apparently, the group had not succeeded in isolating the right protein. However, Brenner continued to use the same techniques in work on the amber mutants of the phage T4. These mutants, which grew only on the Escherichia coli B strain, produced only fragments of the head protein – the protein that the affected gene encoded. By examining the fingerprints of the different mutants, Brenner and his collaborators were able to establish that the length of a fragment corresponded to the position of the mutation on the genetic map, thus proving collinearity. A few months earlier, Charles Yanofsky and his collaborators at Stanford University, using similar techniques, had proved the same point in studies on tryptophane synthetase mutants of E. coli17,18. Besides proving collinearity, sequencing data were also used to establish some general features of the genetic code. On the basis of the few sequences then available, Crick disproved all possible versions of the first genetic code proposed by George Gamow (Ref. 12, p. 94). From published data on protein sequences and neighbor analysis, Brenner later deduced that an overlapping code was impossible19. A further handle on the problem of the genetic code came from analysis of the effects of chemical mutagens in combination with genetic and protein sequence analysis20,21. The code itself was established by entirely different in vitro translation techniques, but the early experiments in
REFLECTIONS
TIBS 24 – MAY 1999 which protein sequencing played a central role were nonetheless crucial for the formulation of the problem. Protein sequencing also remained an important tool for checking the validity of the code. Amino acid substitutions in hemoglobin variants established by fingerprinting, for instance, proved that the genetic code that had been established for bacteria and viruses was also valid for humans22. Interestingly, Crick and Brenner not only used Sanger’s sequencing work as a conceptual and practical tool for their own work in the newly defined field of molecular genetics, but also actively tried to interest Sanger in their work. They first approached him in the early 1950s, trying to convince him to move from the Biochemistry Department to the Cavendish Laboratory. Nothing came of this plan at the time. However, in 1957, the Cavendish group and Sanger joined forces, and together negotiated the creation of a new Laboratory of Molecular Biology23. To my knowledge, it was the first institution to carry that name. The combination of (both twoand three-dimensional) structural and genetic approaches became central to the definition of molecular biology at Cambridge. The creation of the new laboratory also had repercussions on the research agendas of those involved in the new venture.
From protein to nucleic acid sequencing When trying to account for his ‘conversion’ from protein to nucleic acid sequencing, Sanger referred to ‘the atmosphere’ in the Laboratory of Molecular Biology and to the influence of his new colleagues. ‘With people like Francis Crick around,’ he reckoned, ‘it was difficult to ignore nucleic acids or to fail to realize the importance of sequencing them’23. Originally, the main objective of nucleic acid sequencing was to try to ‘break the genetic code’24. However, nucleic acid sequencing got going seriously only after the code was broken. Initially, nucleic acid sequencing seemed an even-more daunting undertaking than protein sequencing had been. This was due to the lack of pure small substrates and to the composition of nucleic acids. Because nucleic acids possessed only four monomers, researchers expected the interpretation of results to be much more difficult. This expectation was based on the existing approaches to studying protein sequences, which required analysis of degradation products and the
Figure 2 Fingerprints of normal and sickle-cell haemoglobin. Note the difference in peptide 4. Figure reproduced, with permission, from Ref. 15.
subsequent rearrangement of the pieces. New developments in sequencing techniques reversed the picture. The first nucleic acid to be sequenced was alanine tRNA, the first small RNA to be isolated. The methods used were similar to those established for protein sequencing: enzymatic degradation followed by fractionation, analysis and interpretation of the degradation products25. These methods were too laborious to be applied to larger RNA or DNA molecules; however, the procedures for more rapid and reliable sequencing subsequently developed by Sanger and others continued to rely on methods pioneered with proteins or on information
derived from protein sequencing. This is especially true of autoradiography and labelling techniques, which allowed one to ‘read off’ the sequence from the autoradiogramme directly and therefore did not require complex interpretative procedures. This latter method became much more powerful in nucleic acid sequencing than it ever had been with proteins. A key development in nucleic acid sequencing was the introduction of copying techniques (instead of sequencing by degradation). But, again, the first primers employed to get the polymerases started were synthesized by using information provided by amino acid sequencing. Protein sequencing
205
BOOK REVIEW also served as an important check for the still unreliable DNA-sequencing data26 (see also recent articles on DNA sequencing27,28).
TIBS 24 – MAY 1999 offered a powerful tool for formulating and testing hypotheses about the functions of genes.
Acknowledgement Conclusions In the debate on the role of biochemists in the history of molecular biology, which has been conducted heatedly since the late 1960s29–31, Sanger represents an interesting case. Despite joining the Laboratory of Molecular Biology in Cambridge, he never gave up his identity as a biochemist – or, more precisely, he never saw the necessity to draw a distinction between the two fields. Interestingly, too, protein sequencing is never mentioned among the techniques that biochemists introduced into molecular biology. My intention here, however, is not to fuel an old debate. The aim of my brief historical excursion is rather to show the important role of protein sequencing in the early history of molecular genetics. Today, ever faster and cheaper nucleic acid sequencing methods have overshadowed more-cumbersome methods of protein sequencing. However, this is only a fairly recent development. Long before nucleic acid sequencing techniques were at all conceivable, protein sequencing was at the forefront of research and
It’s all in the title… Oncogenes and Tumour Suppressors (Frontiers in Molecular Biology, No. 19) edited by Gordon Peters and Karen H. Vousden, Oxford Science Publications, 1997. £29.95 (xix 1 328 pages) ISBN 0 19 963594 3 In his landmark papers of 1910 and 1911, Peyton Rous described a spontaneously arising fibrosarcoma in a Plymouth Rock hen. The characterization of this tumour, which proved to be transplantable and transmissible as a cell-free filtrate, ultimately led to the discovery of acutely transforming retroviruses. The significance of Rous’s work, however, was recognized only some fifty years later: he was awarded the Nobel Prize in 1966. In fact, it was not until 1980, a decade after Rous’s death, that the transforming gene of the avian-sarcoma virus that bears his name was finally sequenced and shown to
206
I thank Sydney Brenner, Francis Crick, John Kendrew and Fred Sanger for extensive discussions, and Denis Thieffry for constructive comments on an earlier version of this paper. SORAYA DE CHADAREVIAN Dept of History and Philosophy of Science, University of Cambridge, Free School Lane, Cambridge, UK CB2 3RH.
References 1 Fox, S.W. (1945) Adv. Protein Chem. 2, 155–177 2 Sanger, F. (1949) Biochem. J. 45, 563–574 3 Sanger, F. and Tuppy, H. (1951) Biochem. J. 49, 481–490 4 Sanger, F. and Thompson, E. O. P. (1953) Biochem. J. 53, 353–374 5 Sanger, F. (1985) Curr. Contents 28, 23 6 Dowling, L. M. and Sparrow, L. G. (1991) Trends Biochem. Sci. 16, 115–119 7 Consden, R., Gordon, A. H., Martin, A. J. P. and Synge, R. L. M. (1947) Biochem. J. 41, 596–602 8 Edman, P. (1950) Acta Chem. Scand., 4, 283–293 9 Hartley, B. S. (1970) in British Biochemistry Past and Present. Biochemical Society Symposium No. 30 (Goodwin, T. W., ed.), pp. 29–41, Academic Press 10 Kendrew, J. C. et al. (1960) Nature 185, 422–427
have a normal cellular counterpart. This definitive proof that non-transformed cells harbour genes that have the potential to become oncogenic (protooncogenes) earnt Harold Varmus and J. Michael Bishop the Nobel Prize in 1989 and ushered in the era of modern molecular oncology. The concept of tumour-suppressor genes arguably could be accredited to the perspicacity of Boveri, who suggested, in 1914, that normal cells possess ‘definite chromosomes which inhibit division’ and that their elimination would result in unlimited growth in tumour cells. Much later, in 1971, Alfred Knudson’s ‘two-hit’ hypothesis explained the incomplete penetrance of inherited cancers and accurately anticipated the molecular lesions that would be discovered in the retinoblastoma gene (RB1) and other growth-inhibitory genes or anti-oncogenes. This is the historical backdrop for Oncogenes and Tumour Suppressors. The editors, given the brief of summarizing the major developments in oncogenesis that have occurred over the past two decades, have assembled a strong cast of
11 Blake, C. C. F. et al. (1965) Nature 206, 757–761 12 Crick, F. (1990) in What Mad Pursuit: A Personal View of Scientific Discovery, Penguin 13 Crick, F. (1958) in The Biological Replication of Macromolecules. Symposia of the Society of Experimental Biology XII, pp. 138–163, Cambridge University Press 14 Pauling, L. (1952) Proc. Am. Philos. Soc. 96, 556–565 15 Ingram, V. M. (1958) Biochim. Biophys. Acta 28, 539–545 16 Judson, H. (1979) Eighth Day of Creation. The Makers of the Revolution in Biology, p. 331, Jonathan Cape 17 Sarabhai, A. S., Stretton, A. O. W., Brenner, S. and Bolle, A. (1964) Nature 201, 13–17 18 Yanofsky, C. et al. (1964) Proc. Natl. Acad. Sci. U. S. A. 51, 266–272 19 Brenner, S. (1957) Proc. Natl. Acad. Sci. U. S. A. 43, 687–694 20 Crick, F. H. C., Barnett, L., Brenner, S. and Watts-Tobin, R. (1961) Nature 192, 1227–1232 21 Kay, L. Who Wrote the Book of Life? A History of the Genetic Code, Stanford University Press (in press) 22 Beale, D. and Lehmann, H. (1965) Nature 207, 259–262 23 de Chadarevian, S. (1996) J. Hist. Biol. 29, 361–386 24 Sanger, F. (1988) Annu. Rev. Biochem. 57, 1–28 25 Holley, R. W. et al. (1965) Science 147, 1462–1465 26 Sanger, F. (1988) Annu. Rev. Biochem. 57, 1–28 27 Wu, R. (1994) Trends Biochem. Sci. 19, 429–433 28 Sutcliffe, J. G. (1995) Trends Biochem. Sci. 20, 87–90 29 Cohen, S. C. (1984) Trends Biochem. Sci. 9, 334–336 30 Abir Am, P. G (1992) Osiris 7, 210–237 31 The Tools of the Discipline: Biochemists and Molecular Biologists (1996) [special issue] J. Hist. Biol. 29, 327–462
authors to produce a slim volume comprising eleven chapters. A search of the literature highlights the enormity of their task. There are .61 000 references in Medline dealing with oncogenes or tumour-suppressor genes and, if one includes papers that deal with the cell cycle, the number rises to @200 000! To make matters worse, over one third of the references are dated 1996 or later. Given that most of the chapters in this book were written in 1996, the book could have been considered well out of date by the time it hit the shelves! So, was this project a futile exercise doomed to failure from the start, or a highly commendable attempt to grapple with odds that would have made Hercules shirk? Before passing judgement, let us examine how the editors set about their task. The book comprises two sections: the first devoted to oncogenes, which are considered as part of a signalling cascade; and the second to tumour-suppressor genes and their intimate relationship with cell-cycle control. Part I commences with an introductory chapter on mechanisms of oncogene perturbation that touches on viral oncogenesis, chromosomal