Analysis of gene expression by tissue and developmental stage Chris Fields The Institute for G e n omic Research, Gaithersburg, USA High-throughput sequencing of cDNAs from multiple tissue- and stage-specific libraries is an efficient method for characterizing gene expression by tissue and developmental stage. When combined with functional information derived from the systematic study of transcription factors, signal transducers, and other regulatory molecules in model systems, data from expressed sequence tag projects provide an increasingly detailed picture of gene expression and its regulation. Understanding this picture will require the development of highly sophisticated databases to organize and correlate these data. Current Opinion in Biotechnology 1994, 5:595-598
Introduction The analysis of gene expression is entering a new period of rapid advancement enabled by the development of efficient high-throughput technologies. Automated sequencing of cDNAs to generate expressed sequence tags (ESTs) [1,2] allows the identification of genes and the characterization of their expression simultaneously. The EST approach to expression analysis complements more traditional approaches, such as two-dimensional protein gel electrophoresis, in situ hybridization of ml:kNAs, or comparative northern blot analysis, and provides both expression data and sequenced clones as results. Concurrent wkh the development and application of EST analysis has been steady progress in the characterization of regulators of gene expression, primarily new transcription factors. Much of this work takes advantage of non-mammalian model systems, in which both forward and reverse genetics are well defined (e.g. Saccharomyces, Caenorhabditis, and Drosophila), to characterize modes of action and mutant phenotypes of genes with mammalian homologs. Rapid gene discovery and the functional characterization of regulatory processes support each other in a particularly fruitful way. Newly identified genes that appear to encode regulators, regardless of phylogenetic source, become immediate candidates for functional characterization via the analysis ofa homolog in an experimentally tractable system. Characterization in a model organism, in turn, provides more precise functional hypotheses to test in the original source organism. As these activities further converge, the gene expression community will
be faced with the challenge of integrating a wealth of information into a coherent picture of the mechanisms of gene expression and their instantiation across the phylogenetic spectrum. Understanding these data will require a new generation o f biological databases and analysis tools. This review summarizes the current state of EST analyses of gene expression in humans and other organisms, and overviews recent work on the characterization of transcription factors and the mechanisms of transcriptional regulation. It also provides pointers to databases that supply gene expression data mainly derived from EST studies.
cDNA-based methods for gene expression analysis Since its introduction in 1991, the EST approach has been used to characterize expressed genes in a variety of tissues and developmental stages of humans [1,3--12], mice [13,14], and wallabies [15], as well as the nematode Caenorhabditis [16,17], various plants [18-21] and prokaryotes [22]. Some groups have focused on 5' end sequencing for gene identification (e.g. [3,7,8]), whereas others have focused on sequencing from the 3' end, which enables unambiguous identification of the same transcript (up to regions subject to alternative splicing) in multiple libraries (e.g. [4,11]). Large numbers of ESTs have been mapped back to chromosomes in humans [1,23,24], Caenorhabditis [16,17], rice [18], and
Abbreviation EST--expressed sequencetag. © Current Biology Ltd ISSN 0958-1669
595
596
Mammalian genestudies maize [20]. Some additional EST sequences that have not been formally published can be found in the dbEST database [25]. By far the largest EST project to date, in which a total of 174 472 ESTs were sequenced from 200 distinct libraries representing 30 human tissues, has recently been completed (MD Adams et al., unpublished data). This massive sequencing effort has identified up to 40000 previously unknown human genes, probably more than half of the protein-coding genes in the human genome [26]. Over 10000 of these new genes are represented by more than one EST, which in most cases, demonstrates more than one site of expression. This study also provides detailed expression data for previously known human genes, most of which are represented by ESTs. A key result is that most highly expressed genes are expressed at high levels only in one or a few tissues; most ubiquitously expressed genes appear to be expressed at only moderate to low levels. The availability of large numbers of EST sequences from humans and other organisms may largely obviate the problem of identifying protein-coding genes in anonymous genomic DNA sequences [27]. EST sequences will also add value to PClL and hybridization-based procedures for differential display of transcripts from distinct sources [28,29]. Given the ability to rapidly sequence PClL products directly, fragments of transcripts isolated using differential display techniques can be rapidly compared with EST data sets. The availability of large numbers of EST sequences also raises the possibility of highly directed hybridization-chip methods [30,31"] for locating identified mlLNAs in extracts from multiple cell types or tissues.
Regulation of gene expression
provides a guiding conceptual framework to studies of gene expression. It is now clear, for example, that homeobox-containing transcription factors play similar roles in establishing body plans throughout the metazoa [33",34"]. The conservation of complex regulatory pathways across broad ranges of organisms has also emerged as a theme in both developmental biology and physiology. The characterization of the hedgehog system in mammals [35"] and other vertebrates, and the further elucidation of the ras pathway [36"] are recent examples. Concomitant with the identification and functional characterization of individual regulatory molecules and the genes encoding them has been the rapid advance in the characterization of upstream transcription factor binding sites and the structures of their complexes with DNA. Work on a wide variety of genes has established that gene expression--especially the cell- and stage-specific expression of genes encoding regulatory molecules--is typically controlled by multiple factors binding multiple sites and imposing a complex geometry on D N A [37]. Co-crystallization studies have allowed the derivation of structures of protein-DNA complexes for several transcription factor families [38"].
Organizing and correlating gene expression data Challenging data on gene expression have generally been available only through the traditional scientific literature. With the advent of high-throughput EST sequencing, with its reliance on complex laboratory data management systems that couple data acquisition and analysis (AlL Kerlavage et al., Proceedings of the 26th Annual Hawaii International Conference on Systems Sciences, Hawaii, 1993, pp 585-594), it has become clear that a new generation of databases capable of representing gene expression data is needed. A specific EST database, dbEST has been established at the United States National Library of Medicine [25]. Beginning with the publication of data on human brain genes [3], EST sequence, expression, similarity, and mapping data have been made available in a more structured form. A new Human cDNA Datal~ase (HCD) at The Institute for Genomic Research (TIGR) supports queries for human EST sequence, assembly, analysis, expression and other data (MD Adams et al., unpublished data). As the volume and complexity of data on mammalian gene expression continues to multiply, databases of progressively higher functionality will be required to organize and interpret them.
Model organisms with well defined genetic systems provide a key resource for characterizing the expression, function, and cellular roles of transcription factors, signal transducers, and other molecules involved in regulating gene expression. The cloning and characterization of genes encoding such molecules in model eukaryotes often precedes the identification of mammalian homologs. Increasingly, the latter are identified by sequence similarity searches using EST or other mammahan sequences. The initial EST study of Adams et al. [1], for example, identified human homologs o f Notch and Enhancer of Split, and subsequent EST projects have identified new human genes encoding hundreds of additional regulatory proteins (MD Adams et al., unpublished data). A human homolog of trithorax was identified by sequencing fragments of genomic clones to construct sequence tagged sites [32].
Conclusions
The identification of homologous regulatory genes that display homologous roles across large phylogenetic distances attests to the power of the EST technique and
EST sequencing is enabling both the identification of genes and the characterization of their expression patterns to be accomplished with staggering speed. Thus,
Analysis of gone expression by tissue and developmental stage Fields 597 it is possible that the basic data required for systematic analyses of regulatory pathways might be available for both humans and a variety of model systems within the next few years. Understanding these data will require a shift from the current methodology of analyzing one or a few genes at a time, to a methodology directed toward the characterization of regulatory mechanisms in complex pathways involving many genes. Techniques for assaying gene expression and action at progressively finer spatial and temporal scales that maintain the high throughputs typical of EST analysis will also be needed. As these techniques mature and are widely applied, it will be possible to attain a much deeper understanding of cellular differentiation and function, and of their perturbation in cancers, infection, and other pathological conditions. Such an understanding will provide a rational basis for diagnosis and therapy with both pharmacological and genetic approaches.
12.
Affara NA, Bentley E, Davey P, Pelmear A, Jones MH: The identification of novel gone sequences of the human adult testis. Genomics 1994, 22:205-210.
13.
Hoog C: Isolation of a large number of novel mammalian genes by a differential cDNA library searching strategy. Nucleic Acids Res 1991, 19:6123-6127.
14.
Starborg M, Brundell E, Hoog C: Analysis of the expression of a large number of novel genes isolated from mouse prepubertal testis. Mol Reprod Dev 1992, 33:243-251.
15.
Collet C, Joseph R: Characterization of "expressed sequence tags" from a marsupial mammary gland cDNA library. Biochem Genet 1994, 32:181-188.
16.
Waterston RH, Martin C, Craxton M, Huynh C, Coulson A, Hillier L, Durbin R, Green P et aL: A survey of expressed genes in Caenorhabditls elegans. Nature Goner 1992, 1:114-123.
17.
McCombie WR, Adams M, Kelley J, Fitzgerald M, Utterback T, Khan M, Dubnick M, Kerlavage AR, Venter JC, Fields C: Caenorhabditis elegans expressed sequence tags identify gone families and disease gone homologues. Nature Genet 1992, 1:124-131.
18.
Uchimiya H, Kidou S, Shimazaki T, Aotsuka S, Takamatsu S, Nishi R, Hashimoto H, Matsubayashi Y, Kidou N, Umeda M, Kato A: Random sequencing of cDNA libraries reveals a variety of expressed genes in cultured cells of rice (Oryza sativa L). Plant J 1992, 2:1005-1009.
19.
Hofle H, Desprez T, Amselem J, Chiapello H, Cabouche M, Moisan A, Jourbon M, Charpanteau J, Berthomieu P, Guerrier D etal.: An inventory of 1152 expressed sequence tags obtained by partial sequencing of cDNAs from Arabidopsis thaliana. Plant J 1993, 4:1051-1061.
20.
Keith C, Hoang D, Barrett B, Feigelman B, Nelson M, Thai H, Baysdorfer C: Partial sequence analysis of 130 randomly selected maize cDNA clones. Plant Physiol 1993, 101:329-332.
21.
Park Y, Kwak J, Kwon O, Kim Y, Lee D, Cho M, Lee H, Nam HG: Generation of expressed sequence tags of random root cDNA clones of Brassica napus by slngle-run partial sequencing. Plant Physiol 1993, 103:359-370.
22.
Kim C, Markiewicz P, Lee J, Schierle C, Miller J: Studies of the hyperthermophile Thermatoga maritima by random sequencing of cDNA and genomic libraries--ldentification and sequencing of the trpeg (D) operon. J Mol Biol 1993, 231:960-981.
23.
Polymeropolous MH, Xiao H, Sikela J, Adams M, Venter JC, Merril CR: Chromosomal distribution of 320 genes from a brain cDNA library. Nature Genet 1993, 4:381-386.
24.
Durkin AS, Maglott DR, Nierman WC: Chromosomal assignment of 38 human brain expressed sequence tags (ESTs) by analyzing fluorescently labeled PCR products from hybrid cell panels. Cytogenet Cell Genet 1994, 65:86-91.
25.
Boguski MS, Lowe TMJ, Tolstoshev CM: dbEST--database for "expressed sequence tags." Nature Genet 1993, 4:332-333.
26.
Fields C, Adams MD, White O, Venter JC: How many genes in the human genome? Nature Genet 1994, 7:345-346.
References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as: • of special interest •, of outstanding interest 1.
Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropolous MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF et aL: Complementary DNA sequencing: expressed sequence tags and the human genome project. Science 1991, 252:1651-1656.
2.
Matsubara K, Okubo K: Identification of new genes by systematic analysis of cDNAs and database construction. Curr Opin Biotechnol 1993, 4:672-677.
3.
Adams MD, Dubnick M, Kerlavage AR, Moreno R, Kelley JM, Utterback TR, Nagle JW, Fields C, Venter JC: Sequence identification of 2375 human brain genes. Nature 1992, 335:632-634.
4.
Okubo K, HoriNN, Matoba R, Niiyama T, Fukushima A, Kojima Y, Matsubara K: Large scale cDNA sequencing for analysis of quantilative and qualitative aspects of gone expression. Nature Goner 1992, 2:173-179.
5.
Khan A, Wilcox A, Polymeropolous M, Hopkins J, Stevens T, Robinson M, Orpana A, Sikela J: Single pass sequencing and physical and genetic mapping of human brain cDNAs. Nature Genet 1992, 2:180-185.
6.
Geiser L, Swaroop A: Expressed sequence tags and chromosomal localization of cDNA clones from a subtracted retinal pigment epithelial library. Genornics 1992, 13:873-876.
7.
Adams M, Kerlavage AR, Fields C, Venter JC: 3,400 new expressed sequence tags identify diversity of transcripts in human brain. Nature Goner 1992, 4:256-267.
27.
B.
Adams M, Soares MB, Kerlavage AR, Fields C, Venter JC: Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nature Goner 1993, 4:373-380.
Fields C: Integrating computational and experimental methods for gone discovery. In Automated DNA sequencing and analysis. Edited by Adams MD, Fields C, Venter JC. London: Academic Press; 1994:321-325.
28.
Welsh J, Chada K, Dalai SS, Ralph D, Cheng R, McClelland M: Arbitrary primed PCR fingerprinting of RNA. Nucleic Acids Res 1992, 20:4965-4970.
9.
Taskeda J, Yano H, Eng S, Zeng Y, Bell G: A molecular inventory of human pancreatic islets: sequence analysis of 1000 cDNA clones. Hum Mol Goner 1993, 2:1793-1798.
29.
Liang P, Pardee AB: Differential display of eukaryolic messenger RNA by means of the polymerase chain reaction. Science 1992, 257:967-971.
10.
Liew CC: A human heart cDNA library: the development of an efficient and simple method for automated DNA sequencing. Cardiology 1993, 25:891-894.
30.
11.
Matsubara K, Okubo K: cDNA analysis in the human genome project. Gone 1993, 135:265-274.
Dramanac R, Dramanac S, Strezoska Z, Pauneska T, Labat I, Zereminski M, Snoddy J, Funkhouser W, Koop B, Hood L, Crkvenjakov R: DNA sequence determination by hybridization: a strategy for efficient large-scale sequencing. Science 1993, 260:1649-1652.
598
M a m m a l i a n gene studies 31. •
PeaseAC, Solas D, Sullivan EJ, Cronin MT, Holmes CP, Fodor SPA: Light-generated oligonucleotide arrays for rapid DNA sequence analysis. Proc Natl Acad Sci USA 1994, 91:5022-5026, This paper describes a working method for fabrication of high-density oligonucleotide hybridization chips. Although it focuses on the technology and its potential diagnostic applications, a similar scheme should also be applicable to expression analysis. 32.
Djabali M, Selleri L, Parry P, Bower M, Young B, Evans G: A trithorax-like gene is interrupted by chromosome 11q23 translocations in acute leukaemias. Nature Genet 1992, 2:113-118.
33. Kenyon C: If birds can fly, why can't we? Homeotic genes • and evolution, Cell 1994, 78:175-180. An excellent and synoptic review of homeobox gene evolution spanning the metazoa. The discovery that homeobox genes exist in nematodes, and thus are not restricted to animals with segmented body plans, has revolutionized thinking concerning the roles of these genes in development. 34. Krumlauf R: Hox genes in vertebrate development, Cell 1994, • 78:191-201. A thorough review of vertebrate homeobox gene function. Most of the relevant work has been done in mice. 35. •
Echelard Y, Epstein D, St-Jacques B, Shen L, Mohler J, McMahon J, McMaflon A: Sonic hedgehog, a member of a family of putative signaling molecules, is implicated in the regulation of CNS polarity. Cell 1993, 75:1417-1430.
This paper reports the initial characterization oi" the hedgehog system in mammals. Accompanying papers by Riddle et al. and Krauss et al. in the same issue of Cell describe homologous genes in chickens and zebrafish. 36. Dickson B, Hafen E: Genetics of signal transduction in inver• tebrates. Curr Opin Genet Dev 1994, 3:64-70. Reviews the genetic reconstruction of ras pathways in model organisms. One of several articles (including a review by McCormick on ras functionJ in this issue of Current Opinion in Genetics and Development that are relevant to gene expression studies. 37,
Tjian R, Maniatis T: Transcriptional activation: a complex puzzle with few easy pieces. Cell 1994, 77:5-8.
38. Burley SK: DNA-binding motifs from eukaryotic transcription • factors. Curr Opin $truct Biol 1994, 4:3-11. Reviews recent work on the structures of DNA-transciption factor complexes, These data are key to understanding changes in DNA geometry imposed by binding. (This issue of Current Opinion in Structural Biology contains several reviews summarizing recent results from studies of protein-DNA interactions.)
C Fields, National Center for Genome Kesources, 1800 Old Pecos Trail, Santa Fe, New Mexico 87505, USA.